code-tokenizers

Name	code-tokenizers JSON
Version	0.0.5 JSON
	download
home_page	https://github.com/ncoop57/code_tokenizers
Summary	Aligning BPE and AST
upload_time	2023-02-06 03:46:53
maintainer
docs_url	None
author	ncoop57
requires_python	>=3.7
license	Apache Software License 2.0
keywords	nbdev jupyter notebook python tokenizer bpe ast
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            code_tokenizers
================

<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

This library is built on top of the awesome
[transformers](https://github.com/huggingface/transformers) and
[tree-sitter](https://github.com/tree-sitter/py-tree-sitter) libraries.
It provides a simple interface to align the tokens produced by a BPE
tokenizer with the tokens produced by a tree-sitter parser.

## Install

``` sh
pip install code_tokenizers
```

## How to use

The main interface of `code_tokenizers` is the
[`CodeTokenizer`](https://ncoop57.github.io/code_tokenizers/core.html#codetokenizer)
class. You can use a pretrained BPE tokenizer from the popular
[transformers](https://huggingface.co/docs/transformers/quicktour#autotokenizer)
library, and a tree-sitter parser from the
[tree-sitter](https://tree-sitter.github.io/tree-sitter/using-parsers#python)
library.

To specify a
[`CodeTokenizer`](https://ncoop57.github.io/code_tokenizers/core.html#codetokenizer)
using the `gpt2` BPE tokenizer and the `python` tree-sitter parser, you
can do:

``` python
from code_tokenizers.core import CodeTokenizer

py_tokenizer = CodeTokenizer.from_pretrained("gpt2", "python")
```

    None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.

You can specify any pretrained BPE tokenizer from the [huggingface
hub](hf.co/models) or a local directory and the language to parse the
AST for.

Now, we can tokenize some code:

``` python
from pprint import pprint

code = """
def foo():
    print("Hello world!")
"""

encoding = py_tokenizer(code)
pprint(encoding, depth=1)
```

    {'ast_ids': [...],
     'attention_mask': [...],
     'input_ids': [...],
     'is_builtins': [...],
     'is_internal_methods': [...],
     'merged_ast': [...],
     'offset_mapping': [...],
     'parent_ast_ids': [...]}

And we can print out the associated AST types:

<div>

> **Note**
>
> Note: Here the N/As are the tokens that are not part of the AST, such
> as the spaces and the newline characters. Their IDs are set to -1.

</div>

``` python
for ast_id, parent_ast_id in zip(encoding["ast_ids"], encoding["parent_ast_ids"]):
    if ast_id != -1:
        print(py_tokenizer.node_types[parent_ast_id], py_tokenizer.node_types[ast_id])
    else:
        print("N/A")
```

    N/A
    function_definition def
    function_definition identifier
    parameters (
    N/A
    N/A
    N/A
    N/A
    call identifier
    argument_list (
    argument_list string
    argument_list string
    argument_list string
    argument_list )
    N/A

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/ncoop57/code_tokenizers",
    "name": "code-tokenizers",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "nbdev jupyter notebook python tokenizer bpe ast",
    "author": "ncoop57",
    "author_email": "nacooper01@wm.edu",
    "download_url": "https://files.pythonhosted.org/packages/b8/83/0d9323f6ea7fe953594392c12dd6c99231ff8e17c06f98853bcb66e7115b/code_tokenizers-0.0.5.tar.gz",
    "platform": null,
    "description": "code_tokenizers\n================\n\n<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->\n\nThis library is built on top of the awesome\n[transformers](https://github.com/huggingface/transformers) and\n[tree-sitter](https://github.com/tree-sitter/py-tree-sitter) libraries.\nIt provides a simple interface to align the tokens produced by a BPE\ntokenizer with the tokens produced by a tree-sitter parser.\n\n## Install\n\n``` sh\npip install code_tokenizers\n```\n\n## How to use\n\nThe main interface of `code_tokenizers` is the\n[`CodeTokenizer`](https://ncoop57.github.io/code_tokenizers/core.html#codetokenizer)\nclass. You can use a pretrained BPE tokenizer from the popular\n[transformers](https://huggingface.co/docs/transformers/quicktour#autotokenizer)\nlibrary, and a tree-sitter parser from the\n[tree-sitter](https://tree-sitter.github.io/tree-sitter/using-parsers#python)\nlibrary.\n\nTo specify a\n[`CodeTokenizer`](https://ncoop57.github.io/code_tokenizers/core.html#codetokenizer)\nusing the `gpt2` BPE tokenizer and the `python` tree-sitter parser, you\ncan do:\n\n``` python\nfrom code_tokenizers.core import CodeTokenizer\n\npy_tokenizer = CodeTokenizer.from_pretrained(\"gpt2\", \"python\")\n```\n\n    None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.\n\nYou can specify any pretrained BPE tokenizer from the [huggingface\nhub](hf.co/models) or a local directory and the language to parse the\nAST for.\n\nNow, we can tokenize some code:\n\n``` python\nfrom pprint import pprint\n\ncode = \"\"\"\ndef foo():\n    print(\"Hello world!\")\n\"\"\"\n\nencoding = py_tokenizer(code)\npprint(encoding, depth=1)\n```\n\n    {'ast_ids': [...],\n     'attention_mask': [...],\n     'input_ids': [...],\n     'is_builtins': [...],\n     'is_internal_methods': [...],\n     'merged_ast': [...],\n     'offset_mapping': [...],\n     'parent_ast_ids': [...]}\n\nAnd we can print out the associated AST types:\n\n<div>\n\n> **Note**\n>\n> Note: Here the N/As are the tokens that are not part of the AST, such\n> as the spaces and the newline characters. Their IDs are set to -1.\n\n</div>\n\n``` python\nfor ast_id, parent_ast_id in zip(encoding[\"ast_ids\"], encoding[\"parent_ast_ids\"]):\n    if ast_id != -1:\n        print(py_tokenizer.node_types[parent_ast_id], py_tokenizer.node_types[ast_id])\n    else:\n        print(\"N/A\")\n```\n\n    N/A\n    function_definition def\n    function_definition identifier\n    parameters (\n    N/A\n    N/A\n    N/A\n    N/A\n    call identifier\n    argument_list (\n    argument_list string\n    argument_list string\n    argument_list string\n    argument_list )\n    N/A\n",
    "bugtrack_url": null,
    "license": "Apache Software License 2.0",
    "summary": "Aligning BPE and AST",
    "version": "0.0.5",
    "split_keywords": [
        "nbdev",
        "jupyter",
        "notebook",
        "python",
        "tokenizer",
        "bpe",
        "ast"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1863ae91f45b305d413edd5f196658142778c5117385d4b633cb40ef8411cf2b",
                "md5": "b5be15d77a1bdf4376309d092e4f4d78",
                "sha256": "b4ce9108c840370cc8dd2582dc9a45451722806c9c8a3226f783870bbd4a9074"
            },
            "downloads": -1,
            "filename": "code_tokenizers-0.0.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "b5be15d77a1bdf4376309d092e4f4d78",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 112642,
            "upload_time": "2023-02-06T03:46:50",
            "upload_time_iso_8601": "2023-02-06T03:46:50.658167Z",
            "url": "https://files.pythonhosted.org/packages/18/63/ae91f45b305d413edd5f196658142778c5117385d4b633cb40ef8411cf2b/code_tokenizers-0.0.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "b8830d9323f6ea7fe953594392c12dd6c99231ff8e17c06f98853bcb66e7115b",
                "md5": "006549670fe1286de9c944fa8f746085",
                "sha256": "796b0dda0555bd5aea0a87643c9a062c444f57f61c92881a326a5242b5b2cdb4"
            },
            "downloads": -1,
            "filename": "code_tokenizers-0.0.5.tar.gz",
            "has_sig": false,
            "md5_digest": "006549670fe1286de9c944fa8f746085",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 13366,
            "upload_time": "2023-02-06T03:46:53",
            "upload_time_iso_8601": "2023-02-06T03:46:53.607461Z",
            "url": "https://files.pythonhosted.org/packages/b8/83/0d9323f6ea7fe953594392c12dd6c99231ff8e17c06f98853bcb66e7115b/code_tokenizers-0.0.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-02-06 03:46:53",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "ncoop57",
    "github_project": "code_tokenizers",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "code-tokenizers"
}

ncoop57