source-parser

Name	source-parser JSON
Version	1.1.0 JSON
	download
home_page	https://github.com/microsoft/source_parser
Summary	Parsers and tools for extracting method/class-level features from source code
upload_time	2024-03-13 05:32:23
maintainer
docs_url	None
author	Microsoft
requires_python	>=3.8
license
keywords	tree_sitter universal-ast codesearchnet method-docstring
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Source Parser

This package contains tools for parsing source code into annotated json data structure: we extracted
import statements, global assignments, top-level methods, classes, class methods and attributes, and annotated
and separate each method and class into signature, docstring, body, and other language-specific attributes
for downstream modeling purposes. The tool provides a high-performance multiprocessing tool called `repo_parse`
which is capable of cloning and performing this parsing for all files in a repository at the rate of thousands
of repositories per minute. See subsequence sections for installation, usage, and at the end a general
description of the structural annotated schema.

## Currently supported languages
 * Python
 * Java
 * Javascript/Typescript
 * C#
 * C++
 * Ruby

## Installation

__NOTE__: this tool is only supported on a **NIX-style OS (Linux, MacOS, FreeBSD, etc)**


### PyPI installation

After following the above instructions, simply invoke

```bash
python -m pip install source-parser
```

## Usage

### Scripting

Simply load the source file contents and hand to a parser, e.g.

```python
from source_parser.parsers import PythonParser
pp = PythonParser(open('source_parser/crawler.py').read())
print(pp.schema)
```

will print the schema extracted from `source_parser/crawler.py`.

### Parsing at scale

The real intention of this tool is to run massively at scale with 100s of thousands
of git repositories. 
Two CLI tools are added upon installation:
 - `repo_parse -h`: semantically parses code using `source_parser`
 - `repo_scrape -h`: just grabs all files matching some patterns

for example:

```bash
repo_parse <language> <repo_list.json> <outdir> [--tmpdir <temporary_directory>]
```

where `<language>` is one of the supported languages indicated in the help message,
`<repo_list.json>` is a path to a `.json` file containin a list of dictionaries with at
least a `'url'` key for a `git` repository and optionally a `'license'` key. `<outdir>`
is the directory in which to place the saved results as a `lz4` compressed `jsonlines`
file, and `--tmpdir` is an optional place to save temporary data like cloned
repositories.

_Protip: mount a RAMdisk and hand it to `--tmpdir` to remove the IO bottleneck
and double parsing speeds! Further, you can set `outdir` to be in the RAMdisk as well, 
so no disk is necessary (if you have enough memory).

```bash
sudo mount -t tmpfs -o size=<size in Gigabytes>G <name-ramdisk> /path/to/ramdisk`
```


### Reading the data

The default compression algorithm used is `lz4` for its high speed and reasonable
compression ratio. Because the data is highly compressible, DO read the data in streaming
fashion and not saving it all in memory uncompressed at once. The JSON dictionaries are 
highly compressible so you can generally expect the uncompressed data to be 2-3x as large.

To this end there is a nice tool in `source_parser/__init__.py`, importable
via `from source_parser import load_zip_json`, which returns an iterator object
which uncompresses and returns only one file-level schema dictionary at a time.

To use:

```python
from source_parser import load_zip_json
for example in load_zip_json('file_saved_from_repocontext.lz4'):
    process_file_example(example)
```

If you'd like to load it all into memory at once:
```python
from source_parser import load_zip_json
all_data = list(load_zip_json('file_saved_from_repocontext.lz4'))
```

### Data Schema

This is a description of the JSON schema into which `source_parser` will
transform source code files, for use in method and class-level code-natural
language modeling. The data will consist of JSON lines, that is valid JSON
separated by newline characters. Each line of JSON will be the features
extracted from a single source code file. The proposed JSON schema for each
individual file is as follows:

_NOTE: See individual language parsers in `source_parser/parsers` for the langauge-specific method and class attributes._

```json
{
	'file_name': 'name_of_file.extension',

    'file_hash': 'hash of file for literal deduplication',

	'relative_path': 'repo_top_level/path/to/file/name_of_file.extension',

	'repo_name': 'owner/repo-name',

    'commit-hash': 'hash of the commit being analyzed',

	'license': {
        'label': 'label provided by github API or in json list',
        'files': [
            'relative_path': 'path/to/license/file',
            'file_contents': 'license file contents',
        ],
    }

    'original_string': 'origina string of file',

	'file_docstring': 'string containing first docstring for all of file',

	'contexts': [
            'import statement 1',
            'import statement 2',
            'global variable expression 1',
            ...
        ],

	'language_version_details': [
        'e.g. python2 syntax detected', 'another languages idiosyncracies'
        ]

	'methods': [  # list of dictionaries annotating each method
		{
            'original_string': 'verbatim code of whole method',

            'byte_span': (start_byte, end_byte),

            'start_point': (start_line_number, start_column),

            'end_point': (end_line_number, end_column),

            'signature': 'string corresponding to definition, name, arguments of method',

            'name': 'name of method',

            'docstring': 'verbatim docstring corresponding to this method',

            'body': 'verbatim code body',

            'original_string_normed': 'code of whole method with string-literal, numeral normalization',

            'signature_normed': 'signature with string-literals/numerals normalized',

            'body_normed': 'code of body with string-literals/numerals normalized',

            'default_arguments': ['arg1': 'default value 1', ...],

            'syntax_pass': 'True/False whether the method is syntactically correct',

            'attributes': [
            	'language_specific_keys': 'language_specific_values',
                'decorators': ['@wrap', '@abstractmethod'],
                ...
            ],
            ...
        },
        ...
	]

	'classes': [
        {
		'original_string': 'verbatim code of class',

        'byte_span': (start_byte, end_byte),

        'start_point': (start_line_number, start_column),

        'end_point': (end_line_number, end_column),

        'name': 'class name',

        'definition': 'class definition statement',

		'class_docstring': 'docstring corresponding to to-level class definition,

		'attributes': {  # language specific keys and values, e.g.
                'expression_statements': [
                    {
                      'expression': 'attribute = 1',
                      'comment': 'comment associated'
                    },
                'classes': [  # classes defined within classes
                    {
                        # same structure as classes
                    }
                ]
                ...
                ]
		    },

		'methods': [
            '# list of class methods of the same form as top-level methods',
            ...
            ]
	    }
    ...
    ]
]
```

## Contributing

We welcome contributions. Please follow [this guideline](CONTRIBUTING.md).

## Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft 
trademarks or logos is subject to and must follow 
[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).
Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
Any use of third-party trademarks or logos are subject to those third-party's policies.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/microsoft/source_parser",
    "name": "source-parser",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "tree_sitter universal-ast codesearchnet method-docstring",
    "author": "Microsoft",
    "author_email": "vsdatascience@microsoft.com",
    "download_url": "",
    "platform": null,
    "description": "# Source Parser\n\nThis package contains tools for parsing source code into annotated json data structure: we extracted\nimport statements, global assignments, top-level methods, classes, class methods and attributes, and annotated\nand separate each method and class into signature, docstring, body, and other language-specific attributes\nfor downstream modeling purposes. The tool provides a high-performance multiprocessing tool called `repo_parse`\nwhich is capable of cloning and performing this parsing for all files in a repository at the rate of thousands\nof repositories per minute. See subsequence sections for installation, usage, and at the end a general\ndescription of the structural annotated schema.\n\n## Currently supported languages\n * Python\n * Java\n * Javascript/Typescript\n * C#\n * C++\n * Ruby\n\n## Installation\n\n__NOTE__: this tool is only supported on a **NIX-style OS (Linux, MacOS, FreeBSD, etc)**\n\n\n### PyPI installation\n\nAfter following the above instructions, simply invoke\n\n```bash\npython -m pip install source-parser\n```\n\n## Usage\n\n### Scripting\n\nSimply load the source file contents and hand to a parser, e.g.\n\n```python\nfrom source_parser.parsers import PythonParser\npp = PythonParser(open('source_parser/crawler.py').read())\nprint(pp.schema)\n```\n\nwill print the schema extracted from `source_parser/crawler.py`.\n\n### Parsing at scale\n\nThe real intention of this tool is to run massively at scale with 100s of thousands\nof git repositories. \nTwo CLI tools are added upon installation:\n - `repo_parse -h`: semantically parses code using `source_parser`\n - `repo_scrape -h`: just grabs all files matching some patterns\n\nfor example:\n\n```bash\nrepo_parse <language> <repo_list.json> <outdir> [--tmpdir <temporary_directory>]\n```\n\nwhere `<language>` is one of the supported languages indicated in the help message,\n`<repo_list.json>` is a path to a `.json` file containin a list of dictionaries with at\nleast a `'url'` key for a `git` repository and optionally a `'license'` key. `<outdir>`\nis the directory in which to place the saved results as a `lz4` compressed `jsonlines`\nfile, and `--tmpdir` is an optional place to save temporary data like cloned\nrepositories.\n\n_Protip: mount a RAMdisk and hand it to `--tmpdir` to remove the IO bottleneck\nand double parsing speeds! Further, you can set `outdir` to be in the RAMdisk as well, \nso no disk is necessary (if you have enough memory).\n\n```bash\nsudo mount -t tmpfs -o size=<size in Gigabytes>G <name-ramdisk> /path/to/ramdisk`\n```\n\n\n### Reading the data\n\nThe default compression algorithm used is `lz4` for its high speed and reasonable\ncompression ratio. Because the data is highly compressible, DO read the data in streaming\nfashion and not saving it all in memory uncompressed at once. The JSON dictionaries are \nhighly compressible so you can generally expect the uncompressed data to be 2-3x as large.\n\nTo this end there is a nice tool in `source_parser/__init__.py`, importable\nvia `from source_parser import load_zip_json`, which returns an iterator object\nwhich uncompresses and returns only one file-level schema dictionary at a time.\n\nTo use:\n\n```python\nfrom source_parser import load_zip_json\nfor example in load_zip_json('file_saved_from_repocontext.lz4'):\n    process_file_example(example)\n```\n\nIf you'd like to load it all into memory at once:\n```python\nfrom source_parser import load_zip_json\nall_data = list(load_zip_json('file_saved_from_repocontext.lz4'))\n```\n\n### Data Schema\n\nThis is a description of the JSON schema into which `source_parser` will\ntransform source code files, for use in method and class-level code-natural\nlanguage modeling. The data will consist of JSON lines, that is valid JSON\nseparated by newline characters. Each line of JSON will be the features\nextracted from a single source code file. The proposed JSON schema for each\nindividual file is as follows:\n\n_NOTE: See individual language parsers in `source_parser/parsers` for the langauge-specific method and class attributes._\n\n```json\n{\n\t'file_name': 'name_of_file.extension',\n\n    'file_hash': 'hash of file for literal deduplication',\n\n\t'relative_path': 'repo_top_level/path/to/file/name_of_file.extension',\n\n\t'repo_name': 'owner/repo-name',\n\n    'commit-hash': 'hash of the commit being analyzed',\n\n\t'license': {\n        'label': 'label provided by github API or in json list',\n        'files': [\n            'relative_path': 'path/to/license/file',\n            'file_contents': 'license file contents',\n        ],\n    }\n\n    'original_string': 'origina string of file',\n\n\t'file_docstring': 'string containing first docstring for all of file',\n\n\t'contexts': [\n            'import statement 1',\n            'import statement 2',\n            'global variable expression 1',\n            ...\n        ],\n\n\t'language_version_details': [\n        'e.g. python2 syntax detected', 'another languages idiosyncracies'\n        ]\n\n\t'methods': [  # list of dictionaries annotating each method\n\t\t{\n            'original_string': 'verbatim code of whole method',\n\n            'byte_span': (start_byte, end_byte),\n\n            'start_point': (start_line_number, start_column),\n\n            'end_point': (end_line_number, end_column),\n\n            'signature': 'string corresponding to definition, name, arguments of method',\n\n            'name': 'name of method',\n\n            'docstring': 'verbatim docstring corresponding to this method',\n\n            'body': 'verbatim code body',\n\n            'original_string_normed': 'code of whole method with string-literal, numeral normalization',\n\n            'signature_normed': 'signature with string-literals/numerals normalized',\n\n            'body_normed': 'code of body with string-literals/numerals normalized',\n\n            'default_arguments': ['arg1': 'default value 1', ...],\n\n            'syntax_pass': 'True/False whether the method is syntactically correct',\n\n            'attributes': [\n            \t'language_specific_keys': 'language_specific_values',\n                'decorators': ['@wrap', '@abstractmethod'],\n                ...\n            ],\n            ...\n        },\n        ...\n\t]\n\n\t'classes': [\n        {\n\t\t'original_string': 'verbatim code of class',\n\n        'byte_span': (start_byte, end_byte),\n\n        'start_point': (start_line_number, start_column),\n\n        'end_point': (end_line_number, end_column),\n\n        'name': 'class name',\n\n        'definition': 'class definition statement',\n\n\t\t'class_docstring': 'docstring corresponding to to-level class definition,\n\n\t\t'attributes': {  # language specific keys and values, e.g.\n                'expression_statements': [\n                    {\n                      'expression': 'attribute = 1',\n                      'comment': 'comment associated'\n                    },\n                'classes': [  # classes defined within classes\n                    {\n                        # same structure as classes\n                    }\n                ]\n                ...\n                ]\n\t\t    },\n\n\t\t'methods': [\n            '# list of class methods of the same form as top-level methods',\n            ...\n            ]\n\t    }\n    ...\n    ]\n]\n```\n\n## Contributing\n\nWe welcome contributions. Please follow [this guideline](CONTRIBUTING.md).\n\n## Trademarks\n\nThis project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft \ntrademarks or logos is subject to and must follow \n[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).\nUse of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.\nAny use of third-party trademarks or logos are subject to those third-party's policies.\n\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Parsers and tools for extracting method/class-level features from source code",
    "version": "1.1.0",
    "project_urls": {
        "Homepage": "https://github.com/microsoft/source_parser"
    },
    "split_keywords": [
        "tree_sitter",
        "universal-ast",
        "codesearchnet",
        "method-docstring"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "b1c23a6aecb56e90340d1b1d49642065e73562c97fcd1be3ad2660d9865158c9",
                "md5": "7c71acb6edc37635bc1e74598dd4004c",
                "sha256": "3d4ecf2f89d464f2128f24078a0e826353fcfd824aaaf2c2c142232c30a6779b"
            },
            "downloads": -1,
            "filename": "source_parser-1.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "7c71acb6edc37635bc1e74598dd4004c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 6020947,
            "upload_time": "2024-03-13T05:32:23",
            "upload_time_iso_8601": "2024-03-13T05:32:23.413833Z",
            "url": "https://files.pythonhosted.org/packages/b1/c2/3a6aecb56e90340d1b1d49642065e73562c97fcd1be3ad2660d9865158c9/source_parser-1.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-03-13 05:32:23",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "microsoft",
    "github_project": "source_parser",
    "github_not_found": true,
    "lcname": "source-parser"
}

Microsoft