ttok


Namettok JSON
Version 0.3 PyPI version JSON
download
home_pagehttps://github.com/simonw/ttok
SummaryCount and truncate text based on tokens
upload_time2024-05-02 23:39:08
maintainerNone
docs_urlNone
authorSimon Willison
requires_python>=3.8
licenseApache License, Version 2.0
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # ttok

[![PyPI](https://img.shields.io/pypi/v/ttok.svg)](https://pypi.org/project/ttok/)
[![Changelog](https://img.shields.io/github/v/release/simonw/ttok?include_prereleases&label=changelog)](https://github.com/simonw/ttok/releases)
[![Tests](https://github.com/simonw/ttok/workflows/Test/badge.svg)](https://github.com/simonw/ttok/actions?query=workflow%3ATest)
[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://github.com/simonw/ttok/blob/master/LICENSE)

Count and truncate text based on tokens

## Background

Large language models such as GPT-3.5 and GPT-4 work in terms of tokens.

This tool can count tokens, using OpenAI's [tiktoken](https://github.com/openai/tiktoken) library.

It can also truncate text to a specified number of tokens.

See [llm, ttok and strip-tags—CLI tools for working with ChatGPT and other LLMs](https://simonwillison.net/2023/May/18/cli-tools-for-llms/) for more on this project.

## Installation

Install this tool using `pip`:
```bash
pip install ttok
```
Or using Homebrew:
```bash
brew install simonw/llm/ttok
```

## Counting tokens

Provide text as arguments to this tool to count tokens:

```bash
ttok Hello world
```
```
2
```
You can also pipe text into the tool:
```bash
echo -n "Hello world" | ttok
```
```
2
```
Here the `echo -n` option prevents echo from adding a newline - without that you would get a token count of 3.

To pipe in text and then append extra tokens from arguments, use the `-i -` option:

```bash
echo -n "Hello world" | ttok more text -i -
```
```
6
```
## Different models

By default, the tokenizer model for GPT-3.5 and GPT-4 is used.

To use the model for GPT-2 and GPT-3, add `--model gpt2`:

```bash
ttok boo Hello there this is -m gpt2
```
```
6
```
Compared to GPT-3.5:
```bash
ttok boo Hello there this is
```
```
5
```
Further model options are [documented here](https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb).

## Truncating text

Use the `-t 10` or `--truncate 10` option to truncate text to a specified number of tokens:

```bash
ttok This is too many tokens -t 3
```
```
This is too
```

## Viewing tokens

The `--encode` option can be used to view the integer token IDs for the incoming text:

```bash
ttok Hello world --encode
```
```
9906 1917
```
The `--decode` method reverses this process:

```bash
ttok 9906 1917 --decode
```
```
Hello world
```
Add `--tokens` to either of these options to see a detailed breakdown of the tokens:

```bash
ttok Hello world --encode --tokens
```
```
[b'Hello', b' world']
```

## Available models

This is the full list of available models and their corresponding encodings. Model names and encoding names are valid for the `-m/--model` option.

<!-- [[[cog
import cog
import tiktoken
output = []
for key, value in tiktoken.model.MODEL_TO_ENCODING.items():
    output.append("- `{}` (`{}`)".format(key, value))
cog.out("\n".join(output))
]]] -->
- `gpt-4` (`cl100k_base`)
- `gpt-3.5-turbo` (`cl100k_base`)
- `gpt-3.5` (`cl100k_base`)
- `gpt-35-turbo` (`cl100k_base`)
- `davinci-002` (`cl100k_base`)
- `babbage-002` (`cl100k_base`)
- `text-embedding-ada-002` (`cl100k_base`)
- `text-embedding-3-small` (`cl100k_base`)
- `text-embedding-3-large` (`cl100k_base`)
- `text-davinci-003` (`p50k_base`)
- `text-davinci-002` (`p50k_base`)
- `text-davinci-001` (`r50k_base`)
- `text-curie-001` (`r50k_base`)
- `text-babbage-001` (`r50k_base`)
- `text-ada-001` (`r50k_base`)
- `davinci` (`r50k_base`)
- `curie` (`r50k_base`)
- `babbage` (`r50k_base`)
- `ada` (`r50k_base`)
- `code-davinci-002` (`p50k_base`)
- `code-davinci-001` (`p50k_base`)
- `code-cushman-002` (`p50k_base`)
- `code-cushman-001` (`p50k_base`)
- `davinci-codex` (`p50k_base`)
- `cushman-codex` (`p50k_base`)
- `text-davinci-edit-001` (`p50k_edit`)
- `code-davinci-edit-001` (`p50k_edit`)
- `text-similarity-davinci-001` (`r50k_base`)
- `text-similarity-curie-001` (`r50k_base`)
- `text-similarity-babbage-001` (`r50k_base`)
- `text-similarity-ada-001` (`r50k_base`)
- `text-search-davinci-doc-001` (`r50k_base`)
- `text-search-curie-doc-001` (`r50k_base`)
- `text-search-babbage-doc-001` (`r50k_base`)
- `text-search-ada-doc-001` (`r50k_base`)
- `code-search-babbage-code-001` (`r50k_base`)
- `code-search-ada-code-001` (`r50k_base`)
- `gpt2` (`gpt2`)
- `gpt-2` (`gpt2`)
<!-- [[[end]]] -->

## ttok --help

<!-- [[[cog
from ttok import cli
from click.testing import CliRunner
runner = CliRunner()
result = runner.invoke(cli.cli, ["--help"])
help = result.output.replace("Usage: cli", "Usage: ttok")
cog.out(
    "```\n{}\n```".format(help)
)
]]] -->
```
Usage: ttok [OPTIONS] [PROMPT]...

  Count and truncate text based on tokens

  To count tokens for text passed as arguments:

      ttok one two three

  To count tokens from stdin:

      cat input.txt | ttok

  To truncate to 100 tokens:

      cat input.txt | ttok -t 100

  To truncate to 100 tokens using the gpt2 model:

      cat input.txt | ttok -t 100 -m gpt2

  To view token integers:

      cat input.txt | ttok --encode

  To convert tokens back to text:

      ttok 9906 1917 --decode

  To see the details of the tokens:

      ttok "hello world" --tokens

  Outputs:

      [b'hello', b' world']

Options:
  --version               Show the version and exit.
  -i, --input FILENAME
  -t, --truncate INTEGER  Truncate to this many tokens
  -m, --model TEXT        Which model to use
  --encode, --tokens      Output token integers
  --decode                Convert token integers to text
  --tokens                Output full tokens
  --allow-special         Do not error on special tokens
  --help                  Show this message and exit.

```
<!-- [[[end]]] -->

You can also run this command using:

```bash
python -m ttok --help
```

## Development

To contribute to this tool, first checkout the code. Then create a new virtual environment:

```bash
cd ttok
python -m venv venv
source venv/bin/activate
```

Now install the dependencies and test dependencies:

```bash
pip install -e '.[test]'
```

To run the tests:

```bash
pytest
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/simonw/ttok",
    "name": "ttok",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": null,
    "author": "Simon Willison",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/d3/d9/2751bb9b3f53206d5a0dd062cee6ca3a9dfbc30f5c8f43d6684a0c0cb810/ttok-0.3.tar.gz",
    "platform": null,
    "description": "# ttok\n\n[![PyPI](https://img.shields.io/pypi/v/ttok.svg)](https://pypi.org/project/ttok/)\n[![Changelog](https://img.shields.io/github/v/release/simonw/ttok?include_prereleases&label=changelog)](https://github.com/simonw/ttok/releases)\n[![Tests](https://github.com/simonw/ttok/workflows/Test/badge.svg)](https://github.com/simonw/ttok/actions?query=workflow%3ATest)\n[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://github.com/simonw/ttok/blob/master/LICENSE)\n\nCount and truncate text based on tokens\n\n## Background\n\nLarge language models such as GPT-3.5 and GPT-4 work in terms of tokens.\n\nThis tool can count tokens, using OpenAI's [tiktoken](https://github.com/openai/tiktoken) library.\n\nIt can also truncate text to a specified number of tokens.\n\nSee [llm, ttok and strip-tags\u2014CLI tools for working with ChatGPT and other LLMs](https://simonwillison.net/2023/May/18/cli-tools-for-llms/) for more on this project.\n\n## Installation\n\nInstall this tool using `pip`:\n```bash\npip install ttok\n```\nOr using Homebrew:\n```bash\nbrew install simonw/llm/ttok\n```\n\n## Counting tokens\n\nProvide text as arguments to this tool to count tokens:\n\n```bash\nttok Hello world\n```\n```\n2\n```\nYou can also pipe text into the tool:\n```bash\necho -n \"Hello world\" | ttok\n```\n```\n2\n```\nHere the `echo -n` option prevents echo from adding a newline - without that you would get a token count of 3.\n\nTo pipe in text and then append extra tokens from arguments, use the `-i -` option:\n\n```bash\necho -n \"Hello world\" | ttok more text -i -\n```\n```\n6\n```\n## Different models\n\nBy default, the tokenizer model for GPT-3.5 and GPT-4 is used.\n\nTo use the model for GPT-2 and GPT-3, add `--model gpt2`:\n\n```bash\nttok boo Hello there this is -m gpt2\n```\n```\n6\n```\nCompared to GPT-3.5:\n```bash\nttok boo Hello there this is\n```\n```\n5\n```\nFurther model options are [documented here](https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb).\n\n## Truncating text\n\nUse the `-t 10` or `--truncate 10` option to truncate text to a specified number of tokens:\n\n```bash\nttok This is too many tokens -t 3\n```\n```\nThis is too\n```\n\n## Viewing tokens\n\nThe `--encode` option can be used to view the integer token IDs for the incoming text:\n\n```bash\nttok Hello world --encode\n```\n```\n9906 1917\n```\nThe `--decode` method reverses this process:\n\n```bash\nttok 9906 1917 --decode\n```\n```\nHello world\n```\nAdd `--tokens` to either of these options to see a detailed breakdown of the tokens:\n\n```bash\nttok Hello world --encode --tokens\n```\n```\n[b'Hello', b' world']\n```\n\n## Available models\n\nThis is the full list of available models and their corresponding encodings. Model names and encoding names are valid for the `-m/--model` option.\n\n<!-- [[[cog\nimport cog\nimport tiktoken\noutput = []\nfor key, value in tiktoken.model.MODEL_TO_ENCODING.items():\n    output.append(\"- `{}` (`{}`)\".format(key, value))\ncog.out(\"\\n\".join(output))\n]]] -->\n- `gpt-4` (`cl100k_base`)\n- `gpt-3.5-turbo` (`cl100k_base`)\n- `gpt-3.5` (`cl100k_base`)\n- `gpt-35-turbo` (`cl100k_base`)\n- `davinci-002` (`cl100k_base`)\n- `babbage-002` (`cl100k_base`)\n- `text-embedding-ada-002` (`cl100k_base`)\n- `text-embedding-3-small` (`cl100k_base`)\n- `text-embedding-3-large` (`cl100k_base`)\n- `text-davinci-003` (`p50k_base`)\n- `text-davinci-002` (`p50k_base`)\n- `text-davinci-001` (`r50k_base`)\n- `text-curie-001` (`r50k_base`)\n- `text-babbage-001` (`r50k_base`)\n- `text-ada-001` (`r50k_base`)\n- `davinci` (`r50k_base`)\n- `curie` (`r50k_base`)\n- `babbage` (`r50k_base`)\n- `ada` (`r50k_base`)\n- `code-davinci-002` (`p50k_base`)\n- `code-davinci-001` (`p50k_base`)\n- `code-cushman-002` (`p50k_base`)\n- `code-cushman-001` (`p50k_base`)\n- `davinci-codex` (`p50k_base`)\n- `cushman-codex` (`p50k_base`)\n- `text-davinci-edit-001` (`p50k_edit`)\n- `code-davinci-edit-001` (`p50k_edit`)\n- `text-similarity-davinci-001` (`r50k_base`)\n- `text-similarity-curie-001` (`r50k_base`)\n- `text-similarity-babbage-001` (`r50k_base`)\n- `text-similarity-ada-001` (`r50k_base`)\n- `text-search-davinci-doc-001` (`r50k_base`)\n- `text-search-curie-doc-001` (`r50k_base`)\n- `text-search-babbage-doc-001` (`r50k_base`)\n- `text-search-ada-doc-001` (`r50k_base`)\n- `code-search-babbage-code-001` (`r50k_base`)\n- `code-search-ada-code-001` (`r50k_base`)\n- `gpt2` (`gpt2`)\n- `gpt-2` (`gpt2`)\n<!-- [[[end]]] -->\n\n## ttok --help\n\n<!-- [[[cog\nfrom ttok import cli\nfrom click.testing import CliRunner\nrunner = CliRunner()\nresult = runner.invoke(cli.cli, [\"--help\"])\nhelp = result.output.replace(\"Usage: cli\", \"Usage: ttok\")\ncog.out(\n    \"```\\n{}\\n```\".format(help)\n)\n]]] -->\n```\nUsage: ttok [OPTIONS] [PROMPT]...\n\n  Count and truncate text based on tokens\n\n  To count tokens for text passed as arguments:\n\n      ttok one two three\n\n  To count tokens from stdin:\n\n      cat input.txt | ttok\n\n  To truncate to 100 tokens:\n\n      cat input.txt | ttok -t 100\n\n  To truncate to 100 tokens using the gpt2 model:\n\n      cat input.txt | ttok -t 100 -m gpt2\n\n  To view token integers:\n\n      cat input.txt | ttok --encode\n\n  To convert tokens back to text:\n\n      ttok 9906 1917 --decode\n\n  To see the details of the tokens:\n\n      ttok \"hello world\" --tokens\n\n  Outputs:\n\n      [b'hello', b' world']\n\nOptions:\n  --version               Show the version and exit.\n  -i, --input FILENAME\n  -t, --truncate INTEGER  Truncate to this many tokens\n  -m, --model TEXT        Which model to use\n  --encode, --tokens      Output token integers\n  --decode                Convert token integers to text\n  --tokens                Output full tokens\n  --allow-special         Do not error on special tokens\n  --help                  Show this message and exit.\n\n```\n<!-- [[[end]]] -->\n\nYou can also run this command using:\n\n```bash\npython -m ttok --help\n```\n\n## Development\n\nTo contribute to this tool, first checkout the code. Then create a new virtual environment:\n\n```bash\ncd ttok\npython -m venv venv\nsource venv/bin/activate\n```\n\nNow install the dependencies and test dependencies:\n\n```bash\npip install -e '.[test]'\n```\n\nTo run the tests:\n\n```bash\npytest\n```\n",
    "bugtrack_url": null,
    "license": "Apache License, Version 2.0",
    "summary": "Count and truncate text based on tokens",
    "version": "0.3",
    "project_urls": {
        "CI": "https://github.com/simonw/ttok/actions",
        "Changelog": "https://github.com/simonw/ttok/releases",
        "Homepage": "https://github.com/simonw/ttok",
        "Issues": "https://github.com/simonw/ttok/issues"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "187b3ea3eefd21c8871f66a54c351f023e0ab5826fd8727c3471d238b96e774f",
                "md5": "c6fdb0317bf715e4a5a7441397603130",
                "sha256": "6a2e214f77c695ee93ea059528636c6db14f3d8bdcd35860bf88ccb94940fcb3"
            },
            "downloads": -1,
            "filename": "ttok-0.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "c6fdb0317bf715e4a5a7441397603130",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 9226,
            "upload_time": "2024-05-02T23:39:06",
            "upload_time_iso_8601": "2024-05-02T23:39:06.727365Z",
            "url": "https://files.pythonhosted.org/packages/18/7b/3ea3eefd21c8871f66a54c351f023e0ab5826fd8727c3471d238b96e774f/ttok-0.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d3d92751bb9b3f53206d5a0dd062cee6ca3a9dfbc30f5c8f43d6684a0c0cb810",
                "md5": "302f4f1cb373298764c59e2291bdefca",
                "sha256": "0474a00a574760db224d24aeafa50e1b9eadf6a574e7a6c1324658bd9b8249b8"
            },
            "downloads": -1,
            "filename": "ttok-0.3.tar.gz",
            "has_sig": false,
            "md5_digest": "302f4f1cb373298764c59e2291bdefca",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 9926,
            "upload_time": "2024-05-02T23:39:08",
            "upload_time_iso_8601": "2024-05-02T23:39:08.127182Z",
            "url": "https://files.pythonhosted.org/packages/d3/d9/2751bb9b3f53206d5a0dd062cee6ca3a9dfbc30f5c8f43d6684a0c0cb810/ttok-0.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-05-02 23:39:08",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "simonw",
    "github_project": "ttok",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "ttok"
}
        
Elapsed time: 2.22948s