pgn-tokenizer

Name	pgn-tokenizer JSON
Version	0.1.3 JSON
	download
home_page	None
Summary	A byte pair encoding tokenizer for chess portable game notation (PGN)
upload_time	2025-01-25 15:02:55
maintainer	None
docs_url	None
author	None
requires_python	>=3.10
license	None
keywords	bpe byte pair encoding chess llm pgn tokenizer tokenizers
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # PGN Tokenizer

This is a Byte Pair Encoding (BPE) tokenizer for chess Portable Game Notation (PGN).

It is uses the [`tokenizers`](https://huggingface.co/docs/tokenizers/) library from Hugging Face for training the tokenizer and the [`transformers`](https://huggingface.co/docs/transformers/) library from Hugging Face for initializing the tokenizer from the pretrained tokenizer model for faster tokenization.

**Note**: This is part of a work-in-progress project to investigate how language models might understand chess without an engine or any chess-specific knowledge.

## Tokenizer Comparison

More traditional, language-focused BPE tokenizer implementations are not suited for PGN strings because they are more likely to break the actual moves apart.

For example `1.e4 Nf6` would likely be tokenized as `1`, `.`, `e`, `4`, ` N`, `f`, `6` or `1`, `.e`, `4`, ` `, ` N`, `f`, `6` depending on the tokenizer's vocabulary, but with the specialized PGN tokenizer it would be tokenized as `1.`, `e4`, ` Nf6`.

### Visualization

Here is a visualization of the vocabulary of this specialized PGN tokenizer compared to the BPE tokenizer vocabularies of the `cl100k_base` (the vocabulary for the `gpt-3.5-turbo` and `gpt-4` models' tokenizer) and the `o200k_base` (the vocabulary for the `gpt-4o` model's tokenizer):

#### PGN Tokenizer

![PGN Tokenizer Visualization](./docs/assets/pgn-tokenizer.png)

**Note**: The tokenizer was trained with ~2.8 Million chess games in PGN notation with a target vocabulary size of `4096`.

#### GPT-3.5-turbo and GPT-4 Tokenizers

![GPT-4 Tokenizer Visualization](./docs/assets/gpt-4-tokenizer.png)

#### GPT-4o Tokenizer

![GPT-4o Tokenizer Visualization](./docs/assets/gpt-4o-tokenizer.png)

These were all generated with a function adapted from an [educational Jupyter Notebook in the `tiktoken` repository](https://github.com/openai/tiktoken/blob/main/tiktoken/_educational.py#L186).

## Installation

You can install it with your package manager of choice:

### uv

```bash
uv add pgn-tokenizer
```

### pip

```bash
pip install pgn-tokenizer
```

## Usage

It exposes a simple interface with `.encode()` and `.decode()` methods, and a `.vocab_size` property, but you can also access the underlying `PreTrainedTokenizerFast` class from the `transformers` library via the `.tokenizer` property.

```python
from pgn_tokenizer import PGNTokenizer

# Initialize the tokenizer
tokenizer = PGNTokenizer()

# Tokenize a PGN string
tokens = tokenizer.encode("1.e4 Nf6 2.e5 Nd5 3.c4 Nb6")

# Decode the tokens back to a PGN string
decoded = tokenizer.decode(tokens)

# get vocab from underlying tokenizer class
vocab = tokenizer.tokenizer.get_vocab()
```

## Acknowledgements

- [@karpathy](https://github.com/karpathy) for the [Let's build the GPT Tokenizer tutorial](https://youtu.be/zduSFxRajkE)
- [Hugging Face](https://huggingface.co/) for the [`tokenizers`](https://huggingface.co/docs/tokenizers/) and [`transformers`](https://huggingface.co/docs/transformers/) libraries.
- Kaggle user [MilesH14](https://www.kaggle.com/milesh14), whoever you are for the now-missing dataset of 3.5 million chess games referenced in many places, including this [research documentation](https://chess-research-project.readthedocs.io/en/latest/)

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "pgn-tokenizer",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "bpe, byte pair encoding, chess, llm, pgn, tokenizer, tokenizers",
    "author": null,
    "author_email": "Eric allen <eric@ericrallen.dev>",
    "download_url": "https://files.pythonhosted.org/packages/ee/0b/4170d12cdac7ac6145fcb5f0bb5e50ac994cecbb7844d7985d1348b5060d/pgn_tokenizer-0.1.3.tar.gz",
    "platform": null,
    "description": "# PGN Tokenizer\n\nThis is a Byte Pair Encoding (BPE) tokenizer for chess Portable Game Notation (PGN).\n\nIt is uses the [`tokenizers`](https://huggingface.co/docs/tokenizers/) library from Hugging Face for training the tokenizer and the [`transformers`](https://huggingface.co/docs/transformers/) library from Hugging Face for initializing the tokenizer from the pretrained tokenizer model for faster tokenization.\n\n**Note**: This is part of a work-in-progress project to investigate how language models might understand chess without an engine or any chess-specific knowledge.\n\n## Tokenizer Comparison\n\nMore traditional, language-focused BPE tokenizer implementations are not suited for PGN strings because they are more likely to break the actual moves apart.\n\nFor example `1.e4 Nf6` would likely be tokenized as `1`, `.`, `e`, `4`, ` N`, `f`, `6` or `1`, `.e`, `4`, ` `, ` N`, `f`, `6` depending on the tokenizer's vocabulary, but with the specialized PGN tokenizer it would be tokenized as `1.`, `e4`, ` Nf6`.\n\n### Visualization\n\nHere is a visualization of the vocabulary of this specialized PGN tokenizer compared to the BPE tokenizer vocabularies of the `cl100k_base` (the vocabulary for the `gpt-3.5-turbo` and `gpt-4` models' tokenizer) and the `o200k_base` (the vocabulary for the `gpt-4o` model's tokenizer):\n\n#### PGN Tokenizer\n\n![PGN Tokenizer Visualization](./docs/assets/pgn-tokenizer.png)\n\n**Note**: The tokenizer was trained with ~2.8 Million chess games in PGN notation with a target vocabulary size of `4096`.\n\n#### GPT-3.5-turbo and GPT-4 Tokenizers\n\n![GPT-4 Tokenizer Visualization](./docs/assets/gpt-4-tokenizer.png)\n\n#### GPT-4o Tokenizer\n\n![GPT-4o Tokenizer Visualization](./docs/assets/gpt-4o-tokenizer.png)\n\nThese were all generated with a function adapted from an [educational Jupyter Notebook in the `tiktoken` repository](https://github.com/openai/tiktoken/blob/main/tiktoken/_educational.py#L186).\n\n## Installation\n\nYou can install it with your package manager of choice:\n\n### uv\n\n```bash\nuv add pgn-tokenizer\n```\n\n### pip\n\n```bash\npip install pgn-tokenizer\n```\n\n## Usage\n\nIt exposes a simple interface with `.encode()` and `.decode()` methods, and a `.vocab_size` property, but you can also access the underlying `PreTrainedTokenizerFast` class from the `transformers` library via the `.tokenizer` property.\n\n```python\nfrom pgn_tokenizer import PGNTokenizer\n\n# Initialize the tokenizer\ntokenizer = PGNTokenizer()\n\n# Tokenize a PGN string\ntokens = tokenizer.encode(\"1.e4 Nf6 2.e5 Nd5 3.c4 Nb6\")\n\n# Decode the tokens back to a PGN string\ndecoded = tokenizer.decode(tokens)\n\n# get vocab from underlying tokenizer class\nvocab = tokenizer.tokenizer.get_vocab()\n```\n\n## Acknowledgements\n\n- [@karpathy](https://github.com/karpathy) for the [Let's build the GPT Tokenizer tutorial](https://youtu.be/zduSFxRajkE)\n- [Hugging Face](https://huggingface.co/) for the [`tokenizers`](https://huggingface.co/docs/tokenizers/) and [`transformers`](https://huggingface.co/docs/transformers/) libraries.\n- Kaggle user [MilesH14](https://www.kaggle.com/milesh14), whoever you are for the now-missing dataset of 3.5 million chess games referenced in many places, including this [research documentation](https://chess-research-project.readthedocs.io/en/latest/)\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A byte pair encoding tokenizer for chess portable game notation (PGN)",
    "version": "0.1.3",
    "project_urls": {
        "Changelog": "https://github.com/DVDAGames/pgn-tokenizer.git/blob/master/CHANGELOG.md",
        "Documentation": "https://github.com/DVDAGames/pgn-tokenizer",
        "Homepage": "https://github.com/DVDAGames/pgn-tokenizer",
        "Issues": "https://github.com/DVDAGames/pgn-tokenizer.git/issues",
        "Repository": "https://github.com/DVDAGames/pgn-tokenizer.git"
    },
    "split_keywords": [
        "bpe",
        " byte pair encoding",
        " chess",
        " llm",
        " pgn",
        " tokenizer",
        " tokenizers"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "d57499da555794aa9957a5205bb9618382eab7272ad7396a2f3c403c904cdfa9",
                "md5": "f1b287e3d8bc0765f69f8690ef6ffc61",
                "sha256": "f8c9861b1ba65a32c2eaa53fd89a278d738b3f754e2a26014d8554c490ca0cee"
            },
            "downloads": -1,
            "filename": "pgn_tokenizer-0.1.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f1b287e3d8bc0765f69f8690ef6ffc61",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 71661,
            "upload_time": "2025-01-25T15:02:51",
            "upload_time_iso_8601": "2025-01-25T15:02:51.403790Z",
            "url": "https://files.pythonhosted.org/packages/d5/74/99da555794aa9957a5205bb9618382eab7272ad7396a2f3c403c904cdfa9/pgn_tokenizer-0.1.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "ee0b4170d12cdac7ac6145fcb5f0bb5e50ac994cecbb7844d7985d1348b5060d",
                "md5": "0950a2705d90b106cea6fc9ac285f135",
                "sha256": "0963446b24785c95ec9e5fea70f6197f7dbd3d98888324c1b196c10b5d64ff38"
            },
            "downloads": -1,
            "filename": "pgn_tokenizer-0.1.3.tar.gz",
            "has_sig": false,
            "md5_digest": "0950a2705d90b106cea6fc9ac285f135",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 1054118,
            "upload_time": "2025-01-25T15:02:55",
            "upload_time_iso_8601": "2025-01-25T15:02:55.134081Z",
            "url": "https://files.pythonhosted.org/packages/ee/0b/4170d12cdac7ac6145fcb5f0bb5e50ac994cecbb7844d7985d1348b5060d/pgn_tokenizer-0.1.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-25 15:02:55",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "DVDAGames",
    "github_project": "pgn-tokenizer",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "pgn-tokenizer"
}

None