tocount

Name	tocount JSON
Version	0.1 JSON
	download
home_page	https://github.com/openscilab/tocount
Summary	ToCount: Lightweight Token Estimator
upload_time	2025-08-30 16:21:30
maintainer	None
docs_url	None
author	ToCount Development Team
requires_python	>=3.7
license	MIT
keywords	token tokenizer estimation llm ml nlp
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage

            
<div align="center">
    <h1>ToCount: Lightweight Token Estimator</h1>
    <br/>
    <a href="https://badge.fury.io/py/tocount"><img src="https://badge.fury.io/py/tocount.svg" alt="PyPI version"></a>
    <a href="https://codecov.io/gh/openscilab/tocount"><img src="https://codecov.io/gh/openscilab/tocount/branch/dev/graph/badge.svg?token=T9T0EPB3V2"></a>
    <a href="https://www.python.org/"><img src="https://img.shields.io/badge/built%20with-Python3-green.svg" alt="built with Python3"></a>
    <a href="https://github.com/openscilab/tocount"><img alt="GitHub repo size" src="https://img.shields.io/github/repo-size/openscilab/tocount"></a>
</div>

----------


## Overview
<p align="justify">
ToCount is a lightweight and extensible Python library for estimating token counts from text inputs using both rule-based and machine learning methods. Designed for flexibility, speed, and accuracy, ToCount provides a unified interface for different estimation strategies, making it ideal for tasks like prompt analysis, token budgeting, and optimizing interactions with token-based systems.
</p>

<table>
    <tr>
        <td align="center">PyPI Counter</td>
        <td align="center">
            <a href="https://pepy.tech/projects/tocount">
                <img src="https://static.pepy.tech/badge/tocount">
            </a>
        </td>
    </tr>
    <tr>
        <td align="center">Github Stars</td>
        <td align="center">
            <a href="https://github.com/openscilab/tocount">
                <img src="https://img.shields.io/github/stars/openscilab/tocount.svg?style=social&label=Stars">
            </a>
        </td>
    </tr>
</table>
<table>
    <tr> 
        <td align="center">Branch</td>
        <td align="center">main</td>
        <td align="center">dev</td>
    </tr>
    <tr>
        <td align="center">CI</td>
        <td align="center">
            <img src="https://github.com/openscilab/tocount/actions/workflows/test.yml/badge.svg?branch=main">
        </td>
        <td align="center">
            <img src="https://github.com/openscilab/tocount/actions/workflows/test.yml/badge.svg?branch=dev">
            </td>
    </tr>
</table>


## Installation

### PyPI
- Check [Python Packaging User Guide](https://packaging.python.org/installing/)
- Run `pip install tocount==0.1`
### Source code
- Download [Version 0.1](https://github.com/openscilab/tocount/archive/v0.1.zip) or [Latest Source](https://github.com/openscilab/tocount/archive/dev.zip)
- Run `pip install .`

## Models

| Model Name                 | Type        |   MAE   |     MSE     |   R²   |
|----------------------------|-------------|---------|-------------|--------|
| `RULE_BASED.UNIVERSAL`     | Rule-Based  | 106.70  | 381,647.81  | 0.8175 |
| `RULE_BASED.GPT_4`         | Rule-Based  | 152.34  | 571,795.89  | 0.7266 |
| `RULE_BASED.GPT_3_5`       | Rule-Based  | 161.93  | 652,923.59  | 0.6878 |

ℹ️ The training and testing dataset is taken from Lmsys-chat-1m [1] and Wildchat [2].

## Usage

```pycon
>>> from tocount import estimate_text_tokens, TextEstimator
>>> estimate_text_tokens("How are you?", estimator=TextEstimator.RULE_BASED.UNIVERSAL)
4
```

## Issues & bug reports

Just fill an issue and describe it. We'll check it ASAP! or send an email to [tocount@openscilab.com](mailto:tocount@openscilab.com "tocount@openscilab.com"). 

- Please complete the issue template

## References

<blockquote>1- Zheng, Lianmin, et al. "Lmsys-chat-1m: A large-scale real-world llm conversation dataset." International Conference on Learning Representations (ICLR) 2024 Spotlights.</blockquote>

<blockquote>2- Zhao, Wenting, et al. "Wildchat: 1m chatgpt interaction logs in the wild." International Conference on Learning Representations (ICLR) 2024 Spotlights.</blockquote>

## Show your support


### Star this repo

Give a ⭐️ if this project helped you!

### Donate to our project
If you do like our project and we hope that you do, can you please support us? Our project is not and is never going to be working for profit. We need the money just so we can continue doing what we do ;-) .			

<a href="https://openscilab.com/#donation" target="_blank"><img src="https://github.com/openscilab/tocount/raw/main/otherfiles/donation.png" width="270" alt="ToCount Donation"></a>

# Changelog
All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html).

## [Unreleased]
## [0.1] - 2025-08-30
### Added
- `RULE_BASED.UNIVERSAL` model
- `RULE_BASED.GPT_4` model
- `RULE_BASED.GPT_3_5` model


[Unreleased]: https://github.com/openscilab/tocount/compare/v0.1...dev
[0.1]: https://github.com/openscilab/tocount/compare/8385d46...v0.1

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/openscilab/tocount",
    "name": "tocount",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "token tokenizer estimation llm ml nlp",
    "author": "ToCount Development Team",
    "author_email": "tocount@openscilab.com",
    "download_url": "https://files.pythonhosted.org/packages/e0/b9/68d372d826c52be0ced7be627cd9665d83e52dbf0d510ed38ca3cac7247d/tocount-0.1.tar.gz",
    "platform": null,
    "description": "\n<div align=\"center\">\n    <h1>ToCount: Lightweight Token Estimator</h1>\n    <br/>\n    <a href=\"https://badge.fury.io/py/tocount\"><img src=\"https://badge.fury.io/py/tocount.svg\" alt=\"PyPI version\"></a>\n    <a href=\"https://codecov.io/gh/openscilab/tocount\"><img src=\"https://codecov.io/gh/openscilab/tocount/branch/dev/graph/badge.svg?token=T9T0EPB3V2\"></a>\n    <a href=\"https://www.python.org/\"><img src=\"https://img.shields.io/badge/built%20with-Python3-green.svg\" alt=\"built with Python3\"></a>\n    <a href=\"https://github.com/openscilab/tocount\"><img alt=\"GitHub repo size\" src=\"https://img.shields.io/github/repo-size/openscilab/tocount\"></a>\n</div>\n\n----------\n\n\n## Overview\n<p align=\"justify\">\nToCount is a lightweight and extensible Python library for estimating token counts from text inputs using both rule-based and machine learning methods. Designed for flexibility, speed, and accuracy, ToCount provides a unified interface for different estimation strategies, making it ideal for tasks like prompt analysis, token budgeting, and optimizing interactions with token-based systems.\n</p>\n\n<table>\n    <tr>\n        <td align=\"center\">PyPI Counter</td>\n        <td align=\"center\">\n            <a href=\"https://pepy.tech/projects/tocount\">\n                <img src=\"https://static.pepy.tech/badge/tocount\">\n            </a>\n        </td>\n    </tr>\n    <tr>\n        <td align=\"center\">Github Stars</td>\n        <td align=\"center\">\n            <a href=\"https://github.com/openscilab/tocount\">\n                <img src=\"https://img.shields.io/github/stars/openscilab/tocount.svg?style=social&label=Stars\">\n            </a>\n        </td>\n    </tr>\n</table>\n<table>\n    <tr> \n        <td align=\"center\">Branch</td>\n        <td align=\"center\">main</td>\n        <td align=\"center\">dev</td>\n    </tr>\n    <tr>\n        <td align=\"center\">CI</td>\n        <td align=\"center\">\n            <img src=\"https://github.com/openscilab/tocount/actions/workflows/test.yml/badge.svg?branch=main\">\n        </td>\n        <td align=\"center\">\n            <img src=\"https://github.com/openscilab/tocount/actions/workflows/test.yml/badge.svg?branch=dev\">\n            </td>\n    </tr>\n</table>\n\n\n## Installation\n\n### PyPI\n- Check [Python Packaging User Guide](https://packaging.python.org/installing/)\n- Run `pip install tocount==0.1`\n### Source code\n- Download [Version 0.1](https://github.com/openscilab/tocount/archive/v0.1.zip) or [Latest Source](https://github.com/openscilab/tocount/archive/dev.zip)\n- Run `pip install .`\n\n## Models\n\n| Model Name                 | Type        |   MAE   |     MSE     |   R\u00b2   |\n|----------------------------|-------------|---------|-------------|--------|\n| `RULE_BASED.UNIVERSAL`     | Rule-Based  | 106.70  | 381,647.81  | 0.8175 |\n| `RULE_BASED.GPT_4`         | Rule-Based  | 152.34  | 571,795.89  | 0.7266 |\n| `RULE_BASED.GPT_3_5`       | Rule-Based  | 161.93  | 652,923.59  | 0.6878 |\n\n\u2139\ufe0f The training and testing dataset is taken from Lmsys-chat-1m [1] and Wildchat [2].\n\n## Usage\n\n```pycon\n>>> from tocount import estimate_text_tokens, TextEstimator\n>>> estimate_text_tokens(\"How are you?\", estimator=TextEstimator.RULE_BASED.UNIVERSAL)\n4\n```\n\n## Issues & bug reports\n\nJust fill an issue and describe it. We'll check it ASAP! or send an email to [tocount@openscilab.com](mailto:tocount@openscilab.com \"tocount@openscilab.com\"). \n\n- Please complete the issue template\n\n## References\n\n<blockquote>1- Zheng, Lianmin, et al. \"Lmsys-chat-1m: A large-scale real-world llm conversation dataset.\" International Conference on Learning Representations (ICLR) 2024 Spotlights.</blockquote>\n\n<blockquote>2- Zhao, Wenting, et al. \"Wildchat: 1m chatgpt interaction logs in the wild.\" International Conference on Learning Representations (ICLR) 2024 Spotlights.</blockquote>\n\n## Show your support\n\n\n### Star this repo\n\nGive a \u2b50\ufe0f if this project helped you!\n\n### Donate to our project\nIf you do like our project and we hope that you do, can you please support us? Our project is not and is never going to be working for profit. We need the money just so we can continue doing what we do ;-) .\t\t\t\n\n<a href=\"https://openscilab.com/#donation\" target=\"_blank\"><img src=\"https://github.com/openscilab/tocount/raw/main/otherfiles/donation.png\" width=\"270\" alt=\"ToCount Donation\"></a>\n\n# Changelog\nAll notable changes to this project will be documented in this file.\n\nThe format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/)\nand this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html).\n\n## [Unreleased]\n## [0.1] - 2025-08-30\n### Added\n- `RULE_BASED.UNIVERSAL` model\n- `RULE_BASED.GPT_4` model\n- `RULE_BASED.GPT_3_5` model\n\n\n[Unreleased]: https://github.com/openscilab/tocount/compare/v0.1...dev\n[0.1]: https://github.com/openscilab/tocount/compare/8385d46...v0.1\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "ToCount: Lightweight Token Estimator",
    "version": "0.1",
    "project_urls": {
        "Download": "https://github.com/openscilab/tocount/tarball/v0.1",
        "Homepage": "https://github.com/openscilab/tocount",
        "Source": "https://github.com/openscilab/tocount"
    },
    "split_keywords": [
        "token",
        "tokenizer",
        "estimation",
        "llm",
        "ml",
        "nlp"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "bca452ab23dd023a1793959caa9dc544b152b4d5b6a693c19862dc87745c0eb0",
                "md5": "d5c76b3cda4fc90eb0c175a49da5b508",
                "sha256": "9246c9f8c85c4af5ec6c437c0bbd0dfa18333a34909721db21220f1149829fab"
            },
            "downloads": -1,
            "filename": "tocount-0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d5c76b3cda4fc90eb0c175a49da5b508",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 11107,
            "upload_time": "2025-08-30T16:21:32",
            "upload_time_iso_8601": "2025-08-30T16:21:32.203179Z",
            "url": "https://files.pythonhosted.org/packages/bc/a4/52ab23dd023a1793959caa9dc544b152b4d5b6a693c19862dc87745c0eb0/tocount-0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "e0b968d372d826c52be0ced7be627cd9665d83e52dbf0d510ed38ca3cac7247d",
                "md5": "d41b60d43691584e2fc990e77e10d48e",
                "sha256": "7ac065b193346696c89906fafe5e3257c5de6a97cac0342cceed06eaa7886f02"
            },
            "downloads": -1,
            "filename": "tocount-0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "d41b60d43691584e2fc990e77e10d48e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 13736,
            "upload_time": "2025-08-30T16:21:30",
            "upload_time_iso_8601": "2025-08-30T16:21:30.844103Z",
            "url": "https://files.pythonhosted.org/packages/e0/b9/68d372d826c52be0ced7be627cd9665d83e52dbf0d510ed38ca3cac7247d/tocount-0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-30 16:21:30",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "openscilab",
    "github_project": "tocount",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": true,
    "requirements": [],
    "lcname": "tocount"
}

ToCount Development Team