# PGN Tokenizer
This is a Byte Pair Encoding (BPE) tokenizer for chess Portable Game Notation (PGN).
It is uses the [`tokenizers`](https://huggingface.co/docs/tokenizers/) library from Hugging Face for training the tokenizer and the [`transformers`](https://huggingface.co/docs/transformers/) library from Hugging Face for initializing the tokenizer from the pretrained tokenizer model for faster tokenization.
**Note**: This is part of a work-in-progress project to investigate how language models might understand chess without an engine or any chess-specific knowledge.
## Tokenizer Comparison
More traditional, language-focused BPE tokenizer implementations are not suited for PGN strings because they are more likely to break the actual moves apart.
For example `1.e4 Nf6` would likely be tokenized as `1`, `.`, `e`, `4`, ` N`, `f`, `6` or `1`, `.e`, `4`, ` `, ` N`, `f`, `6` depending on the tokenizer's vocabulary, but with the specialized PGN tokenizer it would be tokenized as `1.`, `e4`, ` Nf6`.
### Visualization
Here is a visualization of the vocabulary of this specialized PGN tokenizer compared to the BPE tokenizer vocabularies of the `cl100k_base` (the vocabulary for the `gpt-3.5-turbo` and `gpt-4` models' tokenizer) and the `o200k_base` (the vocabulary for the `gpt-4o` model's tokenizer):
#### PGN Tokenizer

**Note**: The tokenizer was trained with ~2.8 Million chess games in PGN notation with a target vocabulary size of `4096`.
#### GPT-3.5-turbo and GPT-4 Tokenizers

#### GPT-4o Tokenizer

These were all generated with a function adapted from an [educational Jupyter Notebook in the `tiktoken` repository](https://github.com/openai/tiktoken/blob/main/tiktoken/_educational.py#L186).
## Installation
You can install it with your package manager of choice:
### uv
```bash
uv add pgn-tokenizer
```
### pip
```bash
pip install pgn-tokenizer
```
## Usage
It exposes a simple interface with `.encode()` and `.decode()` methods, and a `.vocab_size` property, but you can also access the underlying `PreTrainedTokenizerFast` class from the `transformers` library via the `.tokenizer` property.
```python
from pgn_tokenizer import PGNTokenizer
# Initialize the tokenizer
tokenizer = PGNTokenizer()
# Tokenize a PGN string
tokens = tokenizer.encode("1.e4 Nf6 2.e5 Nd5 3.c4 Nb6")
# Decode the tokens back to a PGN string
decoded = tokenizer.decode(tokens)
# get vocab from underlying tokenizer class
vocab = tokenizer.tokenizer.get_vocab()
```
## Acknowledgements
- [@karpathy](https://github.com/karpathy) for the [Let's build the GPT Tokenizer tutorial](https://youtu.be/zduSFxRajkE)
- [Hugging Face](https://huggingface.co/) for the [`tokenizers`](https://huggingface.co/docs/tokenizers/) and [`transformers`](https://huggingface.co/docs/transformers/) libraries.
- Kaggle user [MilesH14](https://www.kaggle.com/milesh14), whoever you are for the now-missing dataset of 3.5 million chess games referenced in many places, including this [research documentation](https://chess-research-project.readthedocs.io/en/latest/)
Raw data
{
"_id": null,
"home_page": null,
"name": "pgn-tokenizer",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "bpe, byte pair encoding, chess, llm, pgn, tokenizer, tokenizers",
"author": null,
"author_email": "Eric allen <eric@ericrallen.dev>",
"download_url": "https://files.pythonhosted.org/packages/ee/0b/4170d12cdac7ac6145fcb5f0bb5e50ac994cecbb7844d7985d1348b5060d/pgn_tokenizer-0.1.3.tar.gz",
"platform": null,
"description": "# PGN Tokenizer\n\nThis is a Byte Pair Encoding (BPE) tokenizer for chess Portable Game Notation (PGN).\n\nIt is uses the [`tokenizers`](https://huggingface.co/docs/tokenizers/) library from Hugging Face for training the tokenizer and the [`transformers`](https://huggingface.co/docs/transformers/) library from Hugging Face for initializing the tokenizer from the pretrained tokenizer model for faster tokenization.\n\n**Note**: This is part of a work-in-progress project to investigate how language models might understand chess without an engine or any chess-specific knowledge.\n\n## Tokenizer Comparison\n\nMore traditional, language-focused BPE tokenizer implementations are not suited for PGN strings because they are more likely to break the actual moves apart.\n\nFor example `1.e4 Nf6` would likely be tokenized as `1`, `.`, `e`, `4`, ` N`, `f`, `6` or `1`, `.e`, `4`, ` `, ` N`, `f`, `6` depending on the tokenizer's vocabulary, but with the specialized PGN tokenizer it would be tokenized as `1.`, `e4`, ` Nf6`.\n\n### Visualization\n\nHere is a visualization of the vocabulary of this specialized PGN tokenizer compared to the BPE tokenizer vocabularies of the `cl100k_base` (the vocabulary for the `gpt-3.5-turbo` and `gpt-4` models' tokenizer) and the `o200k_base` (the vocabulary for the `gpt-4o` model's tokenizer):\n\n#### PGN Tokenizer\n\n\n\n**Note**: The tokenizer was trained with ~2.8 Million chess games in PGN notation with a target vocabulary size of `4096`.\n\n#### GPT-3.5-turbo and GPT-4 Tokenizers\n\n\n\n#### GPT-4o Tokenizer\n\n\n\nThese were all generated with a function adapted from an [educational Jupyter Notebook in the `tiktoken` repository](https://github.com/openai/tiktoken/blob/main/tiktoken/_educational.py#L186).\n\n## Installation\n\nYou can install it with your package manager of choice:\n\n### uv\n\n```bash\nuv add pgn-tokenizer\n```\n\n### pip\n\n```bash\npip install pgn-tokenizer\n```\n\n## Usage\n\nIt exposes a simple interface with `.encode()` and `.decode()` methods, and a `.vocab_size` property, but you can also access the underlying `PreTrainedTokenizerFast` class from the `transformers` library via the `.tokenizer` property.\n\n```python\nfrom pgn_tokenizer import PGNTokenizer\n\n# Initialize the tokenizer\ntokenizer = PGNTokenizer()\n\n# Tokenize a PGN string\ntokens = tokenizer.encode(\"1.e4 Nf6 2.e5 Nd5 3.c4 Nb6\")\n\n# Decode the tokens back to a PGN string\ndecoded = tokenizer.decode(tokens)\n\n# get vocab from underlying tokenizer class\nvocab = tokenizer.tokenizer.get_vocab()\n```\n\n## Acknowledgements\n\n- [@karpathy](https://github.com/karpathy) for the [Let's build the GPT Tokenizer tutorial](https://youtu.be/zduSFxRajkE)\n- [Hugging Face](https://huggingface.co/) for the [`tokenizers`](https://huggingface.co/docs/tokenizers/) and [`transformers`](https://huggingface.co/docs/transformers/) libraries.\n- Kaggle user [MilesH14](https://www.kaggle.com/milesh14), whoever you are for the now-missing dataset of 3.5 million chess games referenced in many places, including this [research documentation](https://chess-research-project.readthedocs.io/en/latest/)\n",
"bugtrack_url": null,
"license": null,
"summary": "A byte pair encoding tokenizer for chess portable game notation (PGN)",
"version": "0.1.3",
"project_urls": {
"Changelog": "https://github.com/DVDAGames/pgn-tokenizer.git/blob/master/CHANGELOG.md",
"Documentation": "https://github.com/DVDAGames/pgn-tokenizer",
"Homepage": "https://github.com/DVDAGames/pgn-tokenizer",
"Issues": "https://github.com/DVDAGames/pgn-tokenizer.git/issues",
"Repository": "https://github.com/DVDAGames/pgn-tokenizer.git"
},
"split_keywords": [
"bpe",
" byte pair encoding",
" chess",
" llm",
" pgn",
" tokenizer",
" tokenizers"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "d57499da555794aa9957a5205bb9618382eab7272ad7396a2f3c403c904cdfa9",
"md5": "f1b287e3d8bc0765f69f8690ef6ffc61",
"sha256": "f8c9861b1ba65a32c2eaa53fd89a278d738b3f754e2a26014d8554c490ca0cee"
},
"downloads": -1,
"filename": "pgn_tokenizer-0.1.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "f1b287e3d8bc0765f69f8690ef6ffc61",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 71661,
"upload_time": "2025-01-25T15:02:51",
"upload_time_iso_8601": "2025-01-25T15:02:51.403790Z",
"url": "https://files.pythonhosted.org/packages/d5/74/99da555794aa9957a5205bb9618382eab7272ad7396a2f3c403c904cdfa9/pgn_tokenizer-0.1.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "ee0b4170d12cdac7ac6145fcb5f0bb5e50ac994cecbb7844d7985d1348b5060d",
"md5": "0950a2705d90b106cea6fc9ac285f135",
"sha256": "0963446b24785c95ec9e5fea70f6197f7dbd3d98888324c1b196c10b5d64ff38"
},
"downloads": -1,
"filename": "pgn_tokenizer-0.1.3.tar.gz",
"has_sig": false,
"md5_digest": "0950a2705d90b106cea6fc9ac285f135",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 1054118,
"upload_time": "2025-01-25T15:02:55",
"upload_time_iso_8601": "2025-01-25T15:02:55.134081Z",
"url": "https://files.pythonhosted.org/packages/ee/0b/4170d12cdac7ac6145fcb5f0bb5e50ac994cecbb7844d7985d1348b5060d/pgn_tokenizer-0.1.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-01-25 15:02:55",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "DVDAGames",
"github_project": "pgn-tokenizer",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "pgn-tokenizer"
}