count-tokens

Name	count-tokens JSON
Version	0.7.2 JSON
	download
home_page	https://github.com/izikeros/count_tokens
Summary	Count number of tokens in the text file using toktoken tokenizer from OpenAI.
upload_time	2025-01-09 05:15:28
maintainer	None
docs_url	None
author	Krystian Safjan
requires_python	<4.0,>=3.9
license	MIT
keywords	count tokens toktoken openai tokenizer
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Count tokens

![img](https://img.shields.io/pypi/v/count-tokens.svg)
![](https://img.shields.io/pypi/pyversions/count-tokens.svg)
![](https://img.shields.io/pypi/dm/count-tokens.svg)
<a href="https://codeclimate.com/github/izikeros/count_tokens/maintainability"><img src="https://api.codeclimate.com/v1/badges/37fd0435fff274b6c9b5/maintainability" /></a>

Simple tool that has one purpose - count tokens in a text file.

## Table of Contents

- [Count tokens](#count-tokens)
	- [Table of Contents](#table-of-contents)
	- [Requirements](#requirements)
	- [Installation](#installation)
	- [Usage](#usage)
	- [Approximate number of tokens](#approximate-number-of-tokens)
	- [Programmatic usage](#programmatic-usage)
	- [Related Projects](#related-projects)
	- [Credits](#credits)
	- [License](#license)

## Requirements

This package is using [tiktoken](https://github.com/openai/tiktoken) library for tokenization.

## Installation

For usage from command line install the package in isolated environment with pipx:

```sh
pipx install count-tokens
```

or install it in your current environment with pip.

```sh
pip install count-tokens
```

## Usage

Open terminal and run:

```sh
count-tokens document.txt
```

You should see something like this:

```sh
File: document.txt
Encoding: cl100k_base
Number of tokens: 67
```

if you want to see just the tokens count run:

```sh
count-tokens document.txt --quiet
```

and the output will be:

```sh
67
```

To use `count-tokens` with other than default `cl100k_base` encoding use the additional input argument `-e` or `--encoding`:

```sh
count-tokens document.txt -e r50k_base
```

NOTE: `tiktoken` supports three encodings used by OpenAI models:

| Encoding name           | OpenAI models                                       |
|-------------------------|-----------------------------------------------------|
| `o200k_base`            | `gpt-4o`, `gpt-4o-mini`                             |
| `cl100k_base`           | `gpt-4`, `gpt-3.5-turbo`, `text-embedding-ada-002`  |
| `p50k_base`             | Codex models, `text-davinci-002`, `text-davinci-003` |
| `r50k_base` (or `gpt2`) | GPT-3 models like `davinci`                         |

(source: [OpenAI Cookbook](https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken))

## Approximate number of tokens

In case you need the results a bit faster and you don't need the exact number of tokens you can use the `--approx` parameter with `w` to have approximation based on number of words or `c` to have approximation based on number of characters.

```shell
count-tokens document.txt --approx w
```

It is based on assumption that there is 4/3 (1 and 1/3) tokens per word and 4 characters per token.

## Programmatic usage

```python
from count_tokens.count import count_tokens_in_file

num_tokens = count_tokens_in_file("document.txt")
```

```python
from count_tokens.count import count_tokens_in_string

num_tokens = count_tokens_in_string("This is a string.")
```

for both functions you can use `encoding` parameter to specify the encoding used by the model:

```python
from count_tokens.count import count_tokens_in_string

num_tokens = count_tokens_in_string("This is a string.", encoding="cl100k_base")
```

Default value for `encoding` is `cl100k_base`.

## Related Projects

- [tiktoken](https://github.com/openai/tiktoken) - tokenization library used by this package
- [ttok](https://github.com/simonw/ttok) - count and truncate text based on tokens

## Credits

Thanks to the authors of the [tiktoken](https://github.com/openai/tiktoken) library for open sourcing their work.

## License

[MIT](https://izikeros.mit-license.org/) © [Krystian Safjan](https://safjan.com).

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/izikeros/count_tokens",
    "name": "count-tokens",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.9",
    "maintainer_email": null,
    "keywords": "count, tokens, toktoken, openai, tokenizer",
    "author": "Krystian Safjan",
    "author_email": "ksafjan@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/c9/7a/1d0b63914867b1b2e9c58333927d5e6bbe1f5a03c84ca83ff5acdd9c7452/count_tokens-0.7.2.tar.gz",
    "platform": null,
    "description": "# Count tokens\n\n![img](https://img.shields.io/pypi/v/count-tokens.svg)\n![](https://img.shields.io/pypi/pyversions/count-tokens.svg)\n![](https://img.shields.io/pypi/dm/count-tokens.svg)\n<a href=\"https://codeclimate.com/github/izikeros/count_tokens/maintainability\"><img src=\"https://api.codeclimate.com/v1/badges/37fd0435fff274b6c9b5/maintainability\" /></a>\n\nSimple tool that has one purpose - count tokens in a text file.\n\n## Table of Contents\n\n- [Count tokens](#count-tokens)\n\t- [Table of Contents](#table-of-contents)\n\t- [Requirements](#requirements)\n\t- [Installation](#installation)\n\t- [Usage](#usage)\n\t- [Approximate number of tokens](#approximate-number-of-tokens)\n\t- [Programmatic usage](#programmatic-usage)\n\t- [Related Projects](#related-projects)\n\t- [Credits](#credits)\n\t- [License](#license)\n\n## Requirements\n\nThis package is using [tiktoken](https://github.com/openai/tiktoken) library for tokenization.\n\n## Installation\n\nFor usage from command line install the package in isolated environment with pipx:\n\n```sh\npipx install count-tokens\n```\n\nor install it in your current environment with pip.\n\n```sh\npip install count-tokens\n```\n\n## Usage\n\nOpen terminal and run:\n\n```sh\ncount-tokens document.txt\n```\n\nYou should see something like this:\n\n```sh\nFile: document.txt\nEncoding: cl100k_base\nNumber of tokens: 67\n```\n\nif you want to see just the tokens count run:\n\n```sh\ncount-tokens document.txt --quiet\n```\n\nand the output will be:\n\n```sh\n67\n```\n\nTo use `count-tokens` with other than default `cl100k_base` encoding use the additional input argument `-e` or `--encoding`:\n\n```sh\ncount-tokens document.txt -e r50k_base\n```\n\nNOTE: `tiktoken` supports three encodings used by OpenAI models:\n\n| Encoding name           | OpenAI models                                       |\n|-------------------------|-----------------------------------------------------|\n| `o200k_base`            | `gpt-4o`, `gpt-4o-mini`                             |\n| `cl100k_base`           | `gpt-4`, `gpt-3.5-turbo`, `text-embedding-ada-002`  |\n| `p50k_base`             | Codex models, `text-davinci-002`, `text-davinci-003` |\n| `r50k_base` (or `gpt2`) | GPT-3 models like `davinci`                         |\n\n(source: [OpenAI Cookbook](https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken))\n\n## Approximate number of tokens\n\nIn case you need the results a bit faster and you don't need the exact number of tokens you can use the `--approx` parameter with `w` to have approximation based on number of words or `c` to have approximation based on number of characters.\n\n```shell\ncount-tokens document.txt --approx w\n```\n\nIt is based on assumption that there is 4/3 (1 and 1/3) tokens per word and 4 characters per token.\n\n## Programmatic usage\n\n```python\nfrom count_tokens.count import count_tokens_in_file\n\nnum_tokens = count_tokens_in_file(\"document.txt\")\n```\n\n```python\nfrom count_tokens.count import count_tokens_in_string\n\nnum_tokens = count_tokens_in_string(\"This is a string.\")\n```\n\nfor both functions you can use `encoding` parameter to specify the encoding used by the model:\n\n```python\nfrom count_tokens.count import count_tokens_in_string\n\nnum_tokens = count_tokens_in_string(\"This is a string.\", encoding=\"cl100k_base\")\n```\n\nDefault value for `encoding` is `cl100k_base`.\n\n## Related Projects\n\n- [tiktoken](https://github.com/openai/tiktoken) - tokenization library used by this package\n- [ttok](https://github.com/simonw/ttok) - count and truncate text based on tokens\n\n## Credits\n\nThanks to the authors of the [tiktoken](https://github.com/openai/tiktoken) library for open sourcing their work.\n\n## License\n\n[MIT](https://izikeros.mit-license.org/) \u00a9 [Krystian Safjan](https://safjan.com).\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Count number of tokens in the text file using toktoken tokenizer from OpenAI.",
    "version": "0.7.2",
    "project_urls": {
        "Documentation": "https://github.com/izikeros/count_tokens",
        "Homepage": "https://github.com/izikeros/count_tokens",
        "Repository": "https://github.com/izikeros/count_tokens"
    },
    "split_keywords": [
        "count",
        " tokens",
        " toktoken",
        " openai",
        " tokenizer"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2d6d9df3bba3528c45966299d2b8bce3b3f4ebcf214e024a147f71d31b261f5e",
                "md5": "ed493a93cb12e39825a20697281f96cd",
                "sha256": "b03d7a651122ee8bdd43c96b73641695862269c8919e10238ee4207584f7cb56"
            },
            "downloads": -1,
            "filename": "count_tokens-0.7.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ed493a93cb12e39825a20697281f96cd",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.9",
            "size": 4039,
            "upload_time": "2025-01-09T05:15:26",
            "upload_time_iso_8601": "2025-01-09T05:15:26.221815Z",
            "url": "https://files.pythonhosted.org/packages/2d/6d/9df3bba3528c45966299d2b8bce3b3f4ebcf214e024a147f71d31b261f5e/count_tokens-0.7.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c97a1d0b63914867b1b2e9c58333927d5e6bbe1f5a03c84ca83ff5acdd9c7452",
                "md5": "0a64f34bcc2d045d33dd9e57ab15bb10",
                "sha256": "63f691ab2e888f5fe4f69c1bc0eeee03e711ef9404edb6effe83e00d5efacc0d"
            },
            "downloads": -1,
            "filename": "count_tokens-0.7.2.tar.gz",
            "has_sig": false,
            "md5_digest": "0a64f34bcc2d045d33dd9e57ab15bb10",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.9",
            "size": 3428,
            "upload_time": "2025-01-09T05:15:28",
            "upload_time_iso_8601": "2025-01-09T05:15:28.515264Z",
            "url": "https://files.pythonhosted.org/packages/c9/7a/1d0b63914867b1b2e9c58333927d5e6bbe1f5a03c84ca83ff5acdd9c7452/count_tokens-0.7.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-09 05:15:28",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "izikeros",
    "github_project": "count_tokens",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "count-tokens"
}

Krystian Safjan