# Count tokens
![img](https://img.shields.io/pypi/v/count-tokens.svg)
![](https://img.shields.io/pypi/pyversions/count-tokens.svg)
![](https://img.shields.io/pypi/dm/count-tokens.svg)
<a href="https://codeclimate.com/github/izikeros/count_tokens/maintainability"><img src="https://api.codeclimate.com/v1/badges/37fd0435fff274b6c9b5/maintainability" /></a>
Simple tool that has one purpose - count tokens in a text file.
## Table of Contents
- [Count tokens](#count-tokens)
- [Table of Contents](#table-of-contents)
- [Requirements](#requirements)
- [Installation](#installation)
- [Usage](#usage)
- [Approximate number of tokens](#approximate-number-of-tokens)
- [Programmatic usage](#programmatic-usage)
- [Related Projects](#related-projects)
- [Credits](#credits)
- [License](#license)
## Requirements
This package is using [tiktoken](https://github.com/openai/tiktoken) library for tokenization.
## Installation
For usage from command line install the package in isolated environment with pipx:
```sh
pipx install count-tokens
```
or install it in your current environment with pip.
```sh
pip install count-tokens
```
## Usage
Open terminal and run:
```sh
count-tokens document.txt
```
You should see something like this:
```sh
File: document.txt
Encoding: cl100k_base
Number of tokens: 67
```
if you want to see just the tokens count run:
```sh
count-tokens document.txt --quiet
```
and the output will be:
```sh
67
```
To use `count-tokens` with other than default `cl100k_base` encoding use the additional input argument `-e` or `--encoding`:
```sh
count-tokens document.txt -e r50k_base
```
NOTE: `tiktoken` supports three encodings used by OpenAI models:
| Encoding name | OpenAI models |
|-------------------------|-----------------------------------------------------|
| `o200k_base` | `gpt-4o`, `gpt-4o-mini` |
| `cl100k_base` | `gpt-4`, `gpt-3.5-turbo`, `text-embedding-ada-002` |
| `p50k_base` | Codex models, `text-davinci-002`, `text-davinci-003` |
| `r50k_base` (or `gpt2`) | GPT-3 models like `davinci` |
(source: [OpenAI Cookbook](https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken))
## Approximate number of tokens
In case you need the results a bit faster and you don't need the exact number of tokens you can use the `--approx` parameter with `w` to have approximation based on number of words or `c` to have approximation based on number of characters.
```shell
count-tokens document.txt --approx w
```
It is based on assumption that there is 4/3 (1 and 1/3) tokens per word and 4 characters per token.
## Programmatic usage
```python
from count_tokens.count import count_tokens_in_file
num_tokens = count_tokens_in_file("document.txt")
```
```python
from count_tokens.count import count_tokens_in_string
num_tokens = count_tokens_in_string("This is a string.")
```
for both functions you can use `encoding` parameter to specify the encoding used by the model:
```python
from count_tokens.count import count_tokens_in_string
num_tokens = count_tokens_in_string("This is a string.", encoding="cl100k_base")
```
Default value for `encoding` is `cl100k_base`.
## Related Projects
- [tiktoken](https://github.com/openai/tiktoken) - tokenization library used by this package
- [ttok](https://github.com/simonw/ttok) - count and truncate text based on tokens
## Credits
Thanks to the authors of the [tiktoken](https://github.com/openai/tiktoken) library for open sourcing their work.
## License
[MIT](https://izikeros.mit-license.org/) © [Krystian Safjan](https://safjan.com).
Raw data
{
"_id": null,
"home_page": "https://github.com/izikeros/count_tokens",
"name": "count-tokens",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.9",
"maintainer_email": null,
"keywords": "count, tokens, toktoken, openai, tokenizer",
"author": "Krystian Safjan",
"author_email": "ksafjan@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/c9/7a/1d0b63914867b1b2e9c58333927d5e6bbe1f5a03c84ca83ff5acdd9c7452/count_tokens-0.7.2.tar.gz",
"platform": null,
"description": "# Count tokens\n\n![img](https://img.shields.io/pypi/v/count-tokens.svg)\n![](https://img.shields.io/pypi/pyversions/count-tokens.svg)\n![](https://img.shields.io/pypi/dm/count-tokens.svg)\n<a href=\"https://codeclimate.com/github/izikeros/count_tokens/maintainability\"><img src=\"https://api.codeclimate.com/v1/badges/37fd0435fff274b6c9b5/maintainability\" /></a>\n\nSimple tool that has one purpose - count tokens in a text file.\n\n## Table of Contents\n\n- [Count tokens](#count-tokens)\n\t- [Table of Contents](#table-of-contents)\n\t- [Requirements](#requirements)\n\t- [Installation](#installation)\n\t- [Usage](#usage)\n\t- [Approximate number of tokens](#approximate-number-of-tokens)\n\t- [Programmatic usage](#programmatic-usage)\n\t- [Related Projects](#related-projects)\n\t- [Credits](#credits)\n\t- [License](#license)\n\n## Requirements\n\nThis package is using [tiktoken](https://github.com/openai/tiktoken) library for tokenization.\n\n## Installation\n\nFor usage from command line install the package in isolated environment with pipx:\n\n```sh\npipx install count-tokens\n```\n\nor install it in your current environment with pip.\n\n```sh\npip install count-tokens\n```\n\n## Usage\n\nOpen terminal and run:\n\n```sh\ncount-tokens document.txt\n```\n\nYou should see something like this:\n\n```sh\nFile: document.txt\nEncoding: cl100k_base\nNumber of tokens: 67\n```\n\nif you want to see just the tokens count run:\n\n```sh\ncount-tokens document.txt --quiet\n```\n\nand the output will be:\n\n```sh\n67\n```\n\nTo use `count-tokens` with other than default `cl100k_base` encoding use the additional input argument `-e` or `--encoding`:\n\n```sh\ncount-tokens document.txt -e r50k_base\n```\n\nNOTE: `tiktoken` supports three encodings used by OpenAI models:\n\n| Encoding name | OpenAI models |\n|-------------------------|-----------------------------------------------------|\n| `o200k_base` | `gpt-4o`, `gpt-4o-mini` |\n| `cl100k_base` | `gpt-4`, `gpt-3.5-turbo`, `text-embedding-ada-002` |\n| `p50k_base` | Codex models, `text-davinci-002`, `text-davinci-003` |\n| `r50k_base` (or `gpt2`) | GPT-3 models like `davinci` |\n\n(source: [OpenAI Cookbook](https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken))\n\n## Approximate number of tokens\n\nIn case you need the results a bit faster and you don't need the exact number of tokens you can use the `--approx` parameter with `w` to have approximation based on number of words or `c` to have approximation based on number of characters.\n\n```shell\ncount-tokens document.txt --approx w\n```\n\nIt is based on assumption that there is 4/3 (1 and 1/3) tokens per word and 4 characters per token.\n\n## Programmatic usage\n\n```python\nfrom count_tokens.count import count_tokens_in_file\n\nnum_tokens = count_tokens_in_file(\"document.txt\")\n```\n\n```python\nfrom count_tokens.count import count_tokens_in_string\n\nnum_tokens = count_tokens_in_string(\"This is a string.\")\n```\n\nfor both functions you can use `encoding` parameter to specify the encoding used by the model:\n\n```python\nfrom count_tokens.count import count_tokens_in_string\n\nnum_tokens = count_tokens_in_string(\"This is a string.\", encoding=\"cl100k_base\")\n```\n\nDefault value for `encoding` is `cl100k_base`.\n\n## Related Projects\n\n- [tiktoken](https://github.com/openai/tiktoken) - tokenization library used by this package\n- [ttok](https://github.com/simonw/ttok) - count and truncate text based on tokens\n\n## Credits\n\nThanks to the authors of the [tiktoken](https://github.com/openai/tiktoken) library for open sourcing their work.\n\n## License\n\n[MIT](https://izikeros.mit-license.org/) \u00a9 [Krystian Safjan](https://safjan.com).\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Count number of tokens in the text file using toktoken tokenizer from OpenAI.",
"version": "0.7.2",
"project_urls": {
"Documentation": "https://github.com/izikeros/count_tokens",
"Homepage": "https://github.com/izikeros/count_tokens",
"Repository": "https://github.com/izikeros/count_tokens"
},
"split_keywords": [
"count",
" tokens",
" toktoken",
" openai",
" tokenizer"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "2d6d9df3bba3528c45966299d2b8bce3b3f4ebcf214e024a147f71d31b261f5e",
"md5": "ed493a93cb12e39825a20697281f96cd",
"sha256": "b03d7a651122ee8bdd43c96b73641695862269c8919e10238ee4207584f7cb56"
},
"downloads": -1,
"filename": "count_tokens-0.7.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "ed493a93cb12e39825a20697281f96cd",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.9",
"size": 4039,
"upload_time": "2025-01-09T05:15:26",
"upload_time_iso_8601": "2025-01-09T05:15:26.221815Z",
"url": "https://files.pythonhosted.org/packages/2d/6d/9df3bba3528c45966299d2b8bce3b3f4ebcf214e024a147f71d31b261f5e/count_tokens-0.7.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "c97a1d0b63914867b1b2e9c58333927d5e6bbe1f5a03c84ca83ff5acdd9c7452",
"md5": "0a64f34bcc2d045d33dd9e57ab15bb10",
"sha256": "63f691ab2e888f5fe4f69c1bc0eeee03e711ef9404edb6effe83e00d5efacc0d"
},
"downloads": -1,
"filename": "count_tokens-0.7.2.tar.gz",
"has_sig": false,
"md5_digest": "0a64f34bcc2d045d33dd9e57ab15bb10",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.9",
"size": 3428,
"upload_time": "2025-01-09T05:15:28",
"upload_time_iso_8601": "2025-01-09T05:15:28.515264Z",
"url": "https://files.pythonhosted.org/packages/c9/7a/1d0b63914867b1b2e9c58333927d5e6bbe1f5a03c84ca83ff5acdd9c7452/count_tokens-0.7.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-01-09 05:15:28",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "izikeros",
"github_project": "count_tokens",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "count-tokens"
}