count-tokens


Namecount-tokens JSON
Version 0.7.0 PyPI version JSON
download
home_pagehttps://github.com/izikeros/count_tokens
SummaryCount number of tokens in the text file using toktoken tokenizer from OpenAI.
upload_time2023-09-26 11:16:08
maintainer
docs_urlNone
authorKrystian Safjan
requires_python>=3.9,<4.0
licenseMIT
keywords count tokens toktoken openai tokenizer
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Count tokens

![img](https://img.shields.io/pypi/v/count-tokens.svg)
![](https://img.shields.io/pypi/pyversions/count-tokens.svg)
![](https://img.shields.io/pypi/dm/count-tokens.svg)
<a href="https://codeclimate.com/github/izikeros/count_tokens/maintainability"><img src="https://api.codeclimate.com/v1/badges/37fd0435fff274b6c9b5/maintainability" /></a>

Simple tool that have one purpose - count tokens in a text file.


## Requirements

This package is using [tiktoken](https://github.com/openai/tiktoken) library for tokenization.


## Installation
For usage from comman line install the package in isolated environement with pipx:

```sh
$ pipx install count-tokens
```

or install it in your current environment with pip.


## Usage
Open terminal and run:

```shell
$ count-tokens document.txt
```

You should see something like this:

```shell
File: document.txt
Encoding: cl100k_base
Number of tokens: 67
```

if you want to see just the tokens count run:

```shell
$ count-tokens document.txt --quiet
```
and the output will be:

```shell
67
```

NOTE: `tiktoken` supports three encodings used by OpenAI models:

| Encoding name           | OpenAI models                                        |
|-------------------------|------------------------------------------------------|
| `cl100k_base`           | `gpt-4`, `gpt-3.5-turbo`, `text-embedding-ada-002`   |
| `p50k_base`             | Codex models, `text-davinci-002`, `text-davinci-003` |
| `r50k_base` (or `gpt2`) | GPT-3 models like `davinci`                          |

to use token-count with other than default `cl100k_base` encoding use the additional input argument `-e` or `--encoding`:

```shell
$ count-tokens document.txt -e r50k_base
```

## Approximate number of tokens
In case you need the results a bit faster and you don't need the exact number of tokens you can use the `--approx` parameter with `w` to have approximation based on number of words or `c` to have approximation based on number of characters.

```shell
$ count-tokens document.txt --approx w
```

It is based on assumption that there is 4/3 (1 and 1/3) tokens per word and 4 characters per token.


```shell

## Programmatic usage

```python
from count_tokens.count import count_tokens_in_file

num_tokens = count_tokens_in_file("document.txt")
```

```python
from count_tokens.count import count_tokens_in_string

num_tokens = count_tokens_in_string("This is a string.")
```

for both functions you can use `encoding` parameter to specify the encoding used by the model:

```python
from count_tokens.count import count_tokens_in_string

num_tokens = count_tokens_in_string("This is a string.", encoding="cl100k_base")
```
Default value for `encoding` is `cl100k_base`.

## Related Projects
- [tiktoken](https://github.com/openai/tiktoken) - tokenization library used by this package

## Credits

Thanks to the authors of the tiktoken library for open sourcing their work.

## License

[MIT](https://izikeros.mit-license.org/) © [Krystian Safjan](https://safjan.com).

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/izikeros/count_tokens",
    "name": "count-tokens",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.9,<4.0",
    "maintainer_email": "",
    "keywords": "count,tokens,toktoken,openai,tokenizer",
    "author": "Krystian Safjan",
    "author_email": "ksafjan@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/a3/88/edef8993c8bfac8f0e70e4adc1fb8ecf9b81c06227d960e2ceb68c5c7a0f/count_tokens-0.7.0.tar.gz",
    "platform": null,
    "description": "# Count tokens\n\n![img](https://img.shields.io/pypi/v/count-tokens.svg)\n![](https://img.shields.io/pypi/pyversions/count-tokens.svg)\n![](https://img.shields.io/pypi/dm/count-tokens.svg)\n<a href=\"https://codeclimate.com/github/izikeros/count_tokens/maintainability\"><img src=\"https://api.codeclimate.com/v1/badges/37fd0435fff274b6c9b5/maintainability\" /></a>\n\nSimple tool that have one purpose - count tokens in a text file.\n\n\n## Requirements\n\nThis package is using [tiktoken](https://github.com/openai/tiktoken) library for tokenization.\n\n\n## Installation\nFor usage from comman line install the package in isolated environement with pipx:\n\n```sh\n$ pipx install count-tokens\n```\n\nor install it in your current environment with pip.\n\n\n## Usage\nOpen terminal and run:\n\n```shell\n$ count-tokens document.txt\n```\n\nYou should see something like this:\n\n```shell\nFile: document.txt\nEncoding: cl100k_base\nNumber of tokens: 67\n```\n\nif you want to see just the tokens count run:\n\n```shell\n$ count-tokens document.txt --quiet\n```\nand the output will be:\n\n```shell\n67\n```\n\nNOTE: `tiktoken` supports three encodings used by OpenAI models:\n\n| Encoding name           | OpenAI models                                        |\n|-------------------------|------------------------------------------------------|\n| `cl100k_base`           | `gpt-4`, `gpt-3.5-turbo`, `text-embedding-ada-002`   |\n| `p50k_base`             | Codex models, `text-davinci-002`, `text-davinci-003` |\n| `r50k_base` (or `gpt2`) | GPT-3 models like `davinci`                          |\n\nto use token-count with other than default `cl100k_base` encoding use the additional input argument `-e` or `--encoding`:\n\n```shell\n$ count-tokens document.txt -e r50k_base\n```\n\n## Approximate number of tokens\nIn case you need the results a bit faster and you don't need the exact number of tokens you can use the `--approx` parameter with `w` to have approximation based on number of words or `c` to have approximation based on number of characters.\n\n```shell\n$ count-tokens document.txt --approx w\n```\n\nIt is based on assumption that there is 4/3 (1 and 1/3) tokens per word and 4 characters per token.\n\n\n```shell\n\n## Programmatic usage\n\n```python\nfrom count_tokens.count import count_tokens_in_file\n\nnum_tokens = count_tokens_in_file(\"document.txt\")\n```\n\n```python\nfrom count_tokens.count import count_tokens_in_string\n\nnum_tokens = count_tokens_in_string(\"This is a string.\")\n```\n\nfor both functions you can use `encoding` parameter to specify the encoding used by the model:\n\n```python\nfrom count_tokens.count import count_tokens_in_string\n\nnum_tokens = count_tokens_in_string(\"This is a string.\", encoding=\"cl100k_base\")\n```\nDefault value for `encoding` is `cl100k_base`.\n\n## Related Projects\n- [tiktoken](https://github.com/openai/tiktoken) - tokenization library used by this package\n\n## Credits\n\nThanks to the authors of the tiktoken library for open sourcing their work.\n\n## License\n\n[MIT](https://izikeros.mit-license.org/) \u00a9 [Krystian Safjan](https://safjan.com).\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Count number of tokens in the text file using toktoken tokenizer from OpenAI.",
    "version": "0.7.0",
    "project_urls": {
        "Documentation": "https://github.com/izikeros/count_tokens",
        "Homepage": "https://github.com/izikeros/count_tokens",
        "Repository": "https://github.com/izikeros/count_tokens"
    },
    "split_keywords": [
        "count",
        "tokens",
        "toktoken",
        "openai",
        "tokenizer"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ea58f504216b7f6aebf132ea9b130a945ed2ad3e4e0ee37ad82c918ecc97374e",
                "md5": "4293cb46653c62e4acbe342b9df4ea12",
                "sha256": "2a3a52d6714fd1b9ebb595a202059c3f184ef06f0317a6bd1d48a7dda5db9e01"
            },
            "downloads": -1,
            "filename": "count_tokens-0.7.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "4293cb46653c62e4acbe342b9df4ea12",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9,<4.0",
            "size": 3826,
            "upload_time": "2023-09-26T11:16:06",
            "upload_time_iso_8601": "2023-09-26T11:16:06.847951Z",
            "url": "https://files.pythonhosted.org/packages/ea/58/f504216b7f6aebf132ea9b130a945ed2ad3e4e0ee37ad82c918ecc97374e/count_tokens-0.7.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a388edef8993c8bfac8f0e70e4adc1fb8ecf9b81c06227d960e2ceb68c5c7a0f",
                "md5": "19192ba778e5ffe008082aece6cef9a4",
                "sha256": "040c6c24295c6176b9d8ce02ad5485ebc6cbbf30f0cc86eee5651d7cbdf2081a"
            },
            "downloads": -1,
            "filename": "count_tokens-0.7.0.tar.gz",
            "has_sig": false,
            "md5_digest": "19192ba778e5ffe008082aece6cef9a4",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9,<4.0",
            "size": 3147,
            "upload_time": "2023-09-26T11:16:08",
            "upload_time_iso_8601": "2023-09-26T11:16:08.202627Z",
            "url": "https://files.pythonhosted.org/packages/a3/88/edef8993c8bfac8f0e70e4adc1fb8ecf9b81c06227d960e2ceb68c5c7a0f/count_tokens-0.7.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-09-26 11:16:08",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "izikeros",
    "github_project": "count_tokens",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "count-tokens"
}
        
Elapsed time: 0.12649s