code_tokenizers
================
<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->
This library is built on top of the awesome
[transformers](https://github.com/huggingface/transformers) and
[tree-sitter](https://github.com/tree-sitter/py-tree-sitter) libraries.
It provides a simple interface to align the tokens produced by a BPE
tokenizer with the tokens produced by a tree-sitter parser.
## Install
``` sh
pip install code_tokenizers
```
## How to use
The main interface of `code_tokenizers` is the
[`CodeTokenizer`](https://ncoop57.github.io/code_tokenizers/core.html#codetokenizer)
class. You can use a pretrained BPE tokenizer from the popular
[transformers](https://huggingface.co/docs/transformers/quicktour#autotokenizer)
library, and a tree-sitter parser from the
[tree-sitter](https://tree-sitter.github.io/tree-sitter/using-parsers#python)
library.
To specify a
[`CodeTokenizer`](https://ncoop57.github.io/code_tokenizers/core.html#codetokenizer)
using the `gpt2` BPE tokenizer and the `python` tree-sitter parser, you
can do:
``` python
from code_tokenizers.core import CodeTokenizer
py_tokenizer = CodeTokenizer.from_pretrained("gpt2", "python")
```
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
You can specify any pretrained BPE tokenizer from the [huggingface
hub](hf.co/models) or a local directory and the language to parse the
AST for.
Now, we can tokenize some code:
``` python
from pprint import pprint
code = """
def foo():
print("Hello world!")
"""
encoding = py_tokenizer(code)
pprint(encoding, depth=1)
```
{'ast_ids': [...],
'attention_mask': [...],
'input_ids': [...],
'is_builtins': [...],
'is_internal_methods': [...],
'merged_ast': [...],
'offset_mapping': [...],
'parent_ast_ids': [...]}
And we can print out the associated AST types:
<div>
> **Note**
>
> Note: Here the N/As are the tokens that are not part of the AST, such
> as the spaces and the newline characters. Their IDs are set to -1.
</div>
``` python
for ast_id, parent_ast_id in zip(encoding["ast_ids"], encoding["parent_ast_ids"]):
if ast_id != -1:
print(py_tokenizer.node_types[parent_ast_id], py_tokenizer.node_types[ast_id])
else:
print("N/A")
```
N/A
function_definition def
function_definition identifier
parameters (
N/A
N/A
N/A
N/A
call identifier
argument_list (
argument_list string
argument_list string
argument_list string
argument_list )
N/A
Raw data
{
"_id": null,
"home_page": "https://github.com/ncoop57/code_tokenizers",
"name": "code-tokenizers",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": "",
"keywords": "nbdev jupyter notebook python tokenizer bpe ast",
"author": "ncoop57",
"author_email": "nacooper01@wm.edu",
"download_url": "https://files.pythonhosted.org/packages/b8/83/0d9323f6ea7fe953594392c12dd6c99231ff8e17c06f98853bcb66e7115b/code_tokenizers-0.0.5.tar.gz",
"platform": null,
"description": "code_tokenizers\n================\n\n<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->\n\nThis library is built on top of the awesome\n[transformers](https://github.com/huggingface/transformers) and\n[tree-sitter](https://github.com/tree-sitter/py-tree-sitter) libraries.\nIt provides a simple interface to align the tokens produced by a BPE\ntokenizer with the tokens produced by a tree-sitter parser.\n\n## Install\n\n``` sh\npip install code_tokenizers\n```\n\n## How to use\n\nThe main interface of `code_tokenizers` is the\n[`CodeTokenizer`](https://ncoop57.github.io/code_tokenizers/core.html#codetokenizer)\nclass. You can use a pretrained BPE tokenizer from the popular\n[transformers](https://huggingface.co/docs/transformers/quicktour#autotokenizer)\nlibrary, and a tree-sitter parser from the\n[tree-sitter](https://tree-sitter.github.io/tree-sitter/using-parsers#python)\nlibrary.\n\nTo specify a\n[`CodeTokenizer`](https://ncoop57.github.io/code_tokenizers/core.html#codetokenizer)\nusing the `gpt2` BPE tokenizer and the `python` tree-sitter parser, you\ncan do:\n\n``` python\nfrom code_tokenizers.core import CodeTokenizer\n\npy_tokenizer = CodeTokenizer.from_pretrained(\"gpt2\", \"python\")\n```\n\n None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.\n\nYou can specify any pretrained BPE tokenizer from the [huggingface\nhub](hf.co/models) or a local directory and the language to parse the\nAST for.\n\nNow, we can tokenize some code:\n\n``` python\nfrom pprint import pprint\n\ncode = \"\"\"\ndef foo():\n print(\"Hello world!\")\n\"\"\"\n\nencoding = py_tokenizer(code)\npprint(encoding, depth=1)\n```\n\n {'ast_ids': [...],\n 'attention_mask': [...],\n 'input_ids': [...],\n 'is_builtins': [...],\n 'is_internal_methods': [...],\n 'merged_ast': [...],\n 'offset_mapping': [...],\n 'parent_ast_ids': [...]}\n\nAnd we can print out the associated AST types:\n\n<div>\n\n> **Note**\n>\n> Note: Here the N/As are the tokens that are not part of the AST, such\n> as the spaces and the newline characters. Their IDs are set to -1.\n\n</div>\n\n``` python\nfor ast_id, parent_ast_id in zip(encoding[\"ast_ids\"], encoding[\"parent_ast_ids\"]):\n if ast_id != -1:\n print(py_tokenizer.node_types[parent_ast_id], py_tokenizer.node_types[ast_id])\n else:\n print(\"N/A\")\n```\n\n N/A\n function_definition def\n function_definition identifier\n parameters (\n N/A\n N/A\n N/A\n N/A\n call identifier\n argument_list (\n argument_list string\n argument_list string\n argument_list string\n argument_list )\n N/A\n",
"bugtrack_url": null,
"license": "Apache Software License 2.0",
"summary": "Aligning BPE and AST",
"version": "0.0.5",
"split_keywords": [
"nbdev",
"jupyter",
"notebook",
"python",
"tokenizer",
"bpe",
"ast"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "1863ae91f45b305d413edd5f196658142778c5117385d4b633cb40ef8411cf2b",
"md5": "b5be15d77a1bdf4376309d092e4f4d78",
"sha256": "b4ce9108c840370cc8dd2582dc9a45451722806c9c8a3226f783870bbd4a9074"
},
"downloads": -1,
"filename": "code_tokenizers-0.0.5-py3-none-any.whl",
"has_sig": false,
"md5_digest": "b5be15d77a1bdf4376309d092e4f4d78",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 112642,
"upload_time": "2023-02-06T03:46:50",
"upload_time_iso_8601": "2023-02-06T03:46:50.658167Z",
"url": "https://files.pythonhosted.org/packages/18/63/ae91f45b305d413edd5f196658142778c5117385d4b633cb40ef8411cf2b/code_tokenizers-0.0.5-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "b8830d9323f6ea7fe953594392c12dd6c99231ff8e17c06f98853bcb66e7115b",
"md5": "006549670fe1286de9c944fa8f746085",
"sha256": "796b0dda0555bd5aea0a87643c9a062c444f57f61c92881a326a5242b5b2cdb4"
},
"downloads": -1,
"filename": "code_tokenizers-0.0.5.tar.gz",
"has_sig": false,
"md5_digest": "006549670fe1286de9c944fa8f746085",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 13366,
"upload_time": "2023-02-06T03:46:53",
"upload_time_iso_8601": "2023-02-06T03:46:53.607461Z",
"url": "https://files.pythonhosted.org/packages/b8/83/0d9323f6ea7fe953594392c12dd6c99231ff8e17c06f98853bcb66e7115b/code_tokenizers-0.0.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-02-06 03:46:53",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "ncoop57",
"github_project": "code_tokenizers",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "code-tokenizers"
}