hityper

Name	hityper JSON
Version	1.0.3 JSON
	download
home_page	https://github.com/JohnnyPeng18/HiTyper
Summary	HiTyper: A hybrid type inference framework for Python
upload_time	2023-07-18 14:27:45
maintainer
docs_url	None
author	Yun Peng
requires_python
license
keywords	python type inference static analysis
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # HiTyper
![](https://img.shields.io/badge/Version-1.0-blue)



This is the tool released in the ICSE 2022 paper ["Static Inference Meets Deep Learning: A Hybrid Type InferenceApproach for Python"](https://arxiv.org/abs/2105.03595).

## Updates

**8 Aug, 2022:** 

We add a new command `hityper preprocess` to transform the json files in ManyTypes4py datasets into the `groundtruth.json` and `detailed_groundtruth.json` files HiTyper needs under `hityper eval`. 

We also add a new option `-g` in `hityper findusertype` to collect the `usertypes.json` HiTyper needs under `hityper eval` according to the groundtruth file `groundtruth.json`.

## Workflow

HiTyper is a hybrid type inference tool built upon Type Dependency Graph (TDG), the typical workflow of it is as follows:

![](https://github.com/JohnnyPeng18/HiTyper/blob/main/imgs/workflow.png)

For more details, please refer to the [paper](https://arxiv.org/abs/2105.03595).

## Methdology

The general methdology of HiTyper is:

1) Static inference is accurate but suffer from coverage problem due to dynamic features

2) Deep learning models are feature-agnostic but they can hardly maintain the type correctness and are unable to predict unseen user-defined types

The combination of static inference and deep learning shall complement each other and improve the coverage while maintaining the accuracy.

## Install

1. Install HiTyper from source

To use HiTyper on your own computer, you can build from source: (If you need to modify the source code of HiTyper, please use this method and re-run the `pip install .` after modification each time)

```sh
git clone https://github.com/JohnnyPeng18/HiTyper.git && cd HiTyper
pip install .
```

2. Install HiTyper using `pip`

You can install the latest version of HiTyper by using the following command:

```sh
pip install hityper
```

**Requirements:**

- Python>=3.9
- Linux

HiTyper requires running under Python >= 3.9 because there are a lot of new nodes introduced on AST from Python 3.9. However, HiTyper can analyze most files written under Python 3 since Python's AST is backward compatible.

You are recommended to use `Anaconda` to create a clean Python 3.9 environment and avoid most dependency conflicts:

````sh
conda create -n hityper python=3.9
````

## Usage

Currently HiTyper has the following command line options: (Some important settings are stored in file `config.py`, you may need to modify it before running HiTyper)

### findusertype

```sh
usage: hityper findusertype [-h] [-s SOURCE] [-p REPO] [-g GROUNDTRUTH] [-c CORE] [-v] [-d OUTPUT_DIRECTORY]

optional arguments:
  -h, --help            show this help message and exit
  -s SOURCE, --source SOURCE
                        Path to a Python source file
  -p REPO, --repo REPO  Path to a Python project
  -g GROUNDTRUTH, --groundtruth GROUNDTRUTH
                        Path to a ground truth file
  -c CORE, --core CORE  Number of cores to use when collecting user-defined types
  -v, --validate        Validate the imported user-defined types by finding their implementations
  -d OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY
                        Path to the store the usertypes
```

**Example of collecting user-defined types in source files:**

```sh
hityper findusertype -s python_project_repo/test.py -p python_project_repo -v -d outputs
```

*This command generates the user-defined types collected by HiTyper and save them as `.json` files under `outputs/` folder.*

`-p` option is required here, if you do not specify `-s`, the HiTyper will collect user-defined types in all files of repo specified by `-p`.

**[Newly Added 6 Aug]**

We add a option to automatically generate all user-defined type files that a ground truth dataset needs to evaluate HiTyper.

**Example of collecting user-defined types in groundtruth datasets:**

```sh
hityper findusertype -g groundtruth.json -p repo_prefix -c 60 -d outputs
```

*This command generates the user-defined types in files indicates by `groundtruth.json` collected by HiTyper and save them as `.json` files under `outputs/` folder.*

For the `groundtruth.json`, you need to use the same file in `hityper eval` command or generate it by using `hityper preprocess` command.

`-p repo_prefix` is an optional argument here, if the filenames in `groundtruth.json` are the absolute paths then you do not need to specify `-p`, otherwise use `-p` to indicate which folder the source files are stored.

The collection of all user-defined types for a large dataset is quite slow, try to specify a large number of cores used to make this process faster.

### gentdg

```sh
hityper gentdg [-h] [-s SOURCE] -p REPO [-o] [-l LOCATION] [-a] [-c] [-d OUTPUT_DIRECTORY] [-f {json,pdf}]

optional arguments:
  -h, --help            show this help message and exit
  -s SOURCE, --source SOURCE
                        Path to a Python source file
  -p REPO, --repo REPO  Path to a Python project
  -o, --optimize        Remove redundant nodes in TDG
  -l LOCATION, --location LOCATION
                        Generate TDG for a specific function
  -a, --alias_analysis  Generate alias graphs along with TDG
  -c, --call_analysis   Generate call graphs along with TDG
  -d OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY
                        Path to the generated TDGs
  -f {json,pdf}, --output_format {json,pdf}
                        Formats of output TDGs
```

**Example:**

```
hityper gentdg -s python_project_repo/test.py -p python_project_repo -d outputs -f json -o
```

*This command generates the TDG for all functions in file `python_project_repo/test.py` and save them into `outputs` folder.* 

Note that if you choose `json` format to save TDG, it will be only ONE `json` file that contains all TDGs in the source file. However, if you choose `pdf` format to save TDG, then there will be multiple `pdf` files and each one correspond to one function in the source file. This is because a pdf file can hardly contain a large TDG for every functions.

For the location indicated by `-l`, use the format `funcname@classname` and use `global` as the classname if the function is a global function.

HiTyper uses [PyCG](https://github.com/vitsalis/PyCG) to build call graphs in call analysis. Alias analysis and call analysis are temporarily built-in but HiTyper does not use them in inference. Further updates about them will be involved in HiTyper. 

### infer

```sh
hityper infer [-h] [-s SOURCE] -p REPO [-l LOCATION] [-d OUTPUT_DIRECTORY] [-m RECOMMENDATIONS] [-t] [-n TOPN]

optional arguments:
  -h, --help            show this help message and exit
  -s SOURCE, --source SOURCE
                        Path to a Python source file
  -p REPO, --repo REPO  Path to a Python project
  -l LOCATION, --location LOCATION
                        Type inference for a specific function
  -d OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY
                        Path to the generated TDGs
  -m RECOMMENDATIONS, --recommendations RECOMMENDATIONS
                        Path to the recommendations generated by a DL model
  -t, --type4py         Use Type4Py as the recommendation model
  -n TOPN, --topn TOPN  Indicate the top n predictions from DL models used by HiTyper
```

**Example:**

```
hityper infer -s python_project_repo/test.py -p python_project_repo -d outputs -n 1 -t 
```

*This command generates the inferred types for all variables, arguments and return values in the source file and save them into `output` folder.*

If you do not specify `-m` or `-t` option, then HiTyper will only use the static inference part to infer types. Static inference generally takes several minutes.

For the location indicated by `-l`, use the format `funcname@classname` and use `global` as the classname if the function is a global function.

**Recommendation Model:**

Note that HiTyper natively supports the recommendations from Type4Py and it invokes the following API provided by Type4Py to get recommendations if you use option `-t`:

```
https://type4py.com/api/predict?tc=0
```

**This will upload your file to the Type4Py server!** If you do not want to upload your file, you can use the [docker](https://github.com/saltudelft/type4py/wiki/Using-Type4Py-Rest-API) provided by Type4Py and changes the API in `config.py` into:

```
http://localhost:PORT/api/predict?tc=0
```

According to our experiments, the Type4Py model has much lower performance by quering the API above, you are suggested to train the model locally and generate the recommendation file which can be passed to `-m`.

**Note: HiTyper's performance deeply depends on the maximum performance of recommendation model (especially the performance to predict argument types). Type inference of HiTyper can fail if the recommendation model cannot give a valid prediction while static inference does not work!** 

If you want to use another more powerful model, you write code like `__main__.py` to adapt HiTyper to your DL model.

### eval

```sh
hityper eval [-h] -g GROUNDTRUTH -c CLASSIFIED_GROUNDTRUTH -u USERTYPE [-m RECOMMENDATIONS] [-t] [-n TOPN]

optional arguments:
  -h, --help            show this help message and exit
  -g GROUNDTRUTH, --groundtruth GROUNDTRUTH
                        Path to a ground truth dataset
  -c CLASSIFIED_GROUNDTRUTH, --classified_groundtruth CLASSIFIED_GROUNDTRUTH
                        Path to a classified ground truth dataset
  -u USERTYPE, --usertype USERTYPE
                        Path to a previously collected user-defined type set
  -m RECOMMENDATIONS, --recommendations RECOMMENDATIONS
                        Path to the recommendations generated by a DL model
  -t, --type4py         Use Type4Py as the recommendation model
  -n TOPN, --topn TOPN  Indicate the top n predictions from DL models used by HiTyper
```

**Example:**

```sh
hityper eval -g groundtruth.json -c detailed_groundtruth.json -u usertypes.json -n 1 -t
```

*This command evaluates the performance of HiTyper on a pre-defined groundtruth dataset. It will output similar results like stated in `Experiment Results` part.*

Before evaluating Hityper using this command, please use `hityper findusertype` command to generate `usertypes.json`. This typically takes several hours, depending on the number of files.

This option is designed only for future research evaluation.

### Preprocess

```sh
usage: hityper preprocess [-h] -p JSON_REPO [-d OUTPUT_DIRECTORY]

optional arguments:
  -h, --help            show this help message and exit
  -p JSON_REPO, --json_repo JSON_REPO
                        Path to the repo of JSON files
  -d OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY
                        Path to the transformed datasets
```

**Example:**

```sh
hityper preprocess -p ManyTypes4PyDataset/processed_projects_complete -d outputs
```

*This command transforms the json files in ManyTypes4Py datasets into the `groundtruth.json` and `detailed_groundtruth.json` files that required by the `hityper eval` command.*

This command is to facilitate the researchers that use ManyTypes4Py dataset and want to evaluate HiTyper in it.

If you want to run HiTyper in other datasets, please follow the same logic in `transformDataset` function of  `HiTyper/hityper/utils.py` to write a script.

## Experiment Results

**Dataset:**

The following results are evaluated using the [ManyTypes4Py](https://zenodo.org/record/4719447#.YjxcpBNBxb8) dataset. 

Since the original dataset does not contain Python source files, to facilitate future research, we here also attached a [link](https://drive.google.com/file/d/1HdZyd3dKAUkiv2Nl0Zynp_YhrqU6HfMx/view?usp=sharing) for the Python source files HiTyper uses to infer types. Attached dataset is not identical with the original one because the original one contains some GitHub repos that do not allow open access or have been deleted.

Note that as stated in the paper, there exists few cases (such as subtypes and same types with different names) that HiTyper should be correct but still counted as wrong in the evaluation process.

**Metrics:**

For the definition of metrics used here, please also refer to the paper. These metrics can be regarded as a kind of "recall", which evaluates the coverage of HiTyper on a specific dataset. We do not show the "precision" here as HiTyper only outputs results when it does not observe any violations with current typing rules and TDG.

**Only using the static inference part:**

| Category           | Exact Match | Match to Parametric | Partial Match |
| ------------------ | ----------- | ------------------- | ------------- |
| Simple Types       | 59.00%      | 59.47%              | 62.15%        |
| Generic Types      | 55.50%      | 69.68%              | 71.90%        |
| User-defined Types | 40.40%      | 40.40%              | 44.30%        |
| Arguments          | 7.65%       | 8.05%               | 14.39%        |
| Return Values      | 58.71%      | 64.61%              | 69.06%        |
| Local Variables    | 61.56%      | 65.66%              | 67.05%        |

You can use the following command to reproduce the above results:

```sh
hityper eval -g ManyTypes4Py_gts_test_verified.json -c ManyTypes4Py_gts_test_verified_detailed.json -u ManyTypes4Py_test_usertypes.json 
```

We do not show the performance of HiTyper integrating different DL models here since there are many factors impacting the performance of DL models such as datasets, hyper-parameters, etc. Please align the performance by yourself before utilizing recommendations from DL models.

What's more, we are currently working on building a DL model that's more suitable for HiTyper. Stay tuned!

**Other datasets:**

If you want to evaluate HiTyper on other datasets, please generate files with the same format with `ManyTypes4Py_gts_test_verified.json`, `ManyTypes4Py_gts_test_verified_detailed.json`, or you can modify the code in `__main__.py`. To check a type's category, you can use `hityper.typeobject.TypeObject.checkType()`.

In any case, you must also prepare the source files for static analysis.

**Old results:**

If you want the exact experiment results stated in the paper, please download them at this [link](https://drive.google.com/file/d/1zFVStp085bfv8WU7UCk9pIE2HEEf-CUh/view?usp=sharing).

## Todo

- Add supports for inter-procedural analysis
- Add supports for types from third-party modules
- Add supports for external function calls
- Add supports for stub files

## Cite Us

If you use HiTyper in your research, please cite us:

```latex
@inproceedings{peng22hityper,
author = {Peng, Yun and Gao, Cuiyun and Li, Zongjie and Gao, Bowei and Lo, David and Zhang, Qirun and Lyu, Michael},
title = {Static Inference Meets Deep Learning: A Hybrid Type Inference Approach for Python},
year = {2022},
isbn = {9781450392211},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3510003.3510038},
doi = {10.1145/3510003.3510038},
booktitle = {Proceedings of the 44th International Conference on Software Engineering},
pages = {2019–2030},
numpages = {12},
location = {Pittsburgh, Pennsylvania},
series = {ICSE '22}
}
```

## Contact

We actively maintain this project and welcome contributions. 

If you have any question, please contact research@yunpeng.work.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/JohnnyPeng18/HiTyper",
    "name": "hityper",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "python,type inference,static analysis",
    "author": "Yun Peng",
    "author_email": "research@yunpeng.work",
    "download_url": "https://files.pythonhosted.org/packages/03/d4/2a609a9b54013527b8ea79391497a6b1344494d308fca095997fbeabc5fa/hityper-1.0.3.tar.gz",
    "platform": null,
    "description": "# HiTyper\n![](https://img.shields.io/badge/Version-1.0-blue)\n\n\n\nThis is the tool released in the ICSE 2022 paper [\"Static Inference Meets Deep Learning: A Hybrid Type InferenceApproach for Python\"](https://arxiv.org/abs/2105.03595).\n\n## Updates\n\n**8 Aug, 2022:** \n\nWe add a new command `hityper preprocess` to transform the json files in ManyTypes4py datasets into the `groundtruth.json` and `detailed_groundtruth.json` files HiTyper needs under `hityper eval`. \n\nWe also add a new option `-g` in `hityper findusertype` to collect the `usertypes.json` HiTyper needs under `hityper eval` according to the groundtruth file `groundtruth.json`.\n\n## Workflow\n\nHiTyper is a hybrid type inference tool built upon Type Dependency Graph (TDG), the typical workflow of it is as follows:\n\n![](https://github.com/JohnnyPeng18/HiTyper/blob/main/imgs/workflow.png)\n\nFor more details, please refer to the [paper](https://arxiv.org/abs/2105.03595).\n\n## Methdology\n\nThe general methdology of HiTyper is:\n\n1) Static inference is accurate but suffer from coverage problem due to dynamic features\n\n2) Deep learning models are feature-agnostic but they can hardly maintain the type correctness and are unable to predict unseen user-defined types\n\nThe combination of static inference and deep learning shall complement each other and improve the coverage while maintaining the accuracy.\n\n## Install\n\n1. Install HiTyper from source\n\nTo use HiTyper on your own computer, you can build from source: (If you need to modify the source code of HiTyper, please use this method and re-run the `pip install .` after modification each time)\n\n```sh\ngit clone https://github.com/JohnnyPeng18/HiTyper.git && cd HiTyper\npip install .\n```\n\n2. Install HiTyper using `pip`\n\nYou can install the latest version of HiTyper by using the following command:\n\n```sh\npip install hityper\n```\n\n**Requirements:**\n\n- Python>=3.9\n- Linux\n\nHiTyper requires running under Python >= 3.9 because there are a lot of new nodes introduced on AST from Python 3.9. However, HiTyper can analyze most files written under Python 3 since Python's AST is backward compatible.\n\nYou are recommended to use `Anaconda` to create a clean Python 3.9 environment and avoid most dependency conflicts:\n\n````sh\nconda create -n hityper python=3.9\n````\n\n## Usage\n\nCurrently HiTyper has the following command line options: (Some important settings are stored in file `config.py`, you may need to modify it before running HiTyper)\n\n### findusertype\n\n```sh\nusage: hityper findusertype [-h] [-s SOURCE] [-p REPO] [-g GROUNDTRUTH] [-c CORE] [-v] [-d OUTPUT_DIRECTORY]\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -s SOURCE, --source SOURCE\n                        Path to a Python source file\n  -p REPO, --repo REPO  Path to a Python project\n  -g GROUNDTRUTH, --groundtruth GROUNDTRUTH\n                        Path to a ground truth file\n  -c CORE, --core CORE  Number of cores to use when collecting user-defined types\n  -v, --validate        Validate the imported user-defined types by finding their implementations\n  -d OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY\n                        Path to the store the usertypes\n```\n\n**Example of collecting user-defined types in source files:**\n\n```sh\nhityper findusertype -s python_project_repo/test.py -p python_project_repo -v -d outputs\n```\n\n*This command generates the user-defined types collected by HiTyper and save them as `.json` files under `outputs/` folder.*\n\n`-p` option is required here, if you do not specify `-s`, the HiTyper will collect user-defined types in all files of repo specified by `-p`.\n\n**[Newly Added 6 Aug]**\n\nWe add a option to automatically generate all user-defined type files that a ground truth dataset needs to evaluate HiTyper.\n\n**Example of collecting user-defined types in groundtruth datasets:**\n\n```sh\nhityper findusertype -g groundtruth.json -p repo_prefix -c 60 -d outputs\n```\n\n*This command generates the user-defined types in files indicates by `groundtruth.json` collected by HiTyper and save them as `.json` files under `outputs/` folder.*\n\nFor the `groundtruth.json`, you need to use the same file in `hityper eval` command or generate it by using `hityper preprocess` command.\n\n`-p repo_prefix` is an optional argument here, if the filenames in `groundtruth.json` are the absolute paths then you do not need to specify `-p`, otherwise use `-p` to indicate which folder the source files are stored.\n\nThe collection of all user-defined types for a large dataset is quite slow, try to specify a large number of cores used to make this process faster.\n\n### gentdg\n\n```sh\nhityper gentdg [-h] [-s SOURCE] -p REPO [-o] [-l LOCATION] [-a] [-c] [-d OUTPUT_DIRECTORY] [-f {json,pdf}]\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -s SOURCE, --source SOURCE\n                        Path to a Python source file\n  -p REPO, --repo REPO  Path to a Python project\n  -o, --optimize        Remove redundant nodes in TDG\n  -l LOCATION, --location LOCATION\n                        Generate TDG for a specific function\n  -a, --alias_analysis  Generate alias graphs along with TDG\n  -c, --call_analysis   Generate call graphs along with TDG\n  -d OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY\n                        Path to the generated TDGs\n  -f {json,pdf}, --output_format {json,pdf}\n                        Formats of output TDGs\n```\n\n**Example:**\n\n```\nhityper gentdg -s python_project_repo/test.py -p python_project_repo -d outputs -f json -o\n```\n\n*This command generates the TDG for all functions in file `python_project_repo/test.py` and save them into `outputs` folder.* \n\nNote that if you choose `json` format to save TDG, it will be only ONE `json` file that contains all TDGs in the source file. However, if you choose `pdf` format to save TDG, then there will be multiple `pdf` files and each one correspond to one function in the source file. This is because a pdf file can hardly contain a large TDG for every functions.\n\nFor the location indicated by `-l`, use the format `funcname@classname` and use `global` as the classname if the function is a global function.\n\nHiTyper uses [PyCG](https://github.com/vitsalis/PyCG) to build call graphs in call analysis. Alias analysis and call analysis are temporarily built-in but HiTyper does not use them in inference. Further updates about them will be involved in HiTyper. \n\n### infer\n\n```sh\nhityper infer [-h] [-s SOURCE] -p REPO [-l LOCATION] [-d OUTPUT_DIRECTORY] [-m RECOMMENDATIONS] [-t] [-n TOPN]\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -s SOURCE, --source SOURCE\n                        Path to a Python source file\n  -p REPO, --repo REPO  Path to a Python project\n  -l LOCATION, --location LOCATION\n                        Type inference for a specific function\n  -d OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY\n                        Path to the generated TDGs\n  -m RECOMMENDATIONS, --recommendations RECOMMENDATIONS\n                        Path to the recommendations generated by a DL model\n  -t, --type4py         Use Type4Py as the recommendation model\n  -n TOPN, --topn TOPN  Indicate the top n predictions from DL models used by HiTyper\n```\n\n**Example:**\n\n```\nhityper infer -s python_project_repo/test.py -p python_project_repo -d outputs -n 1 -t \n```\n\n*This command generates the inferred types for all variables, arguments and return values in the source file and save them into `output` folder.*\n\nIf you do not specify `-m` or `-t` option, then HiTyper will only use the static inference part to infer types. Static inference generally takes several minutes.\n\nFor the location indicated by `-l`, use the format `funcname@classname` and use `global` as the classname if the function is a global function.\n\n**Recommendation Model:**\n\nNote that HiTyper natively supports the recommendations from Type4Py and it invokes the following API provided by Type4Py to get recommendations if you use option `-t`:\n\n```\nhttps://type4py.com/api/predict?tc=0\n```\n\n**This will upload your file to the Type4Py server!** If you do not want to upload your file, you can use the [docker](https://github.com/saltudelft/type4py/wiki/Using-Type4Py-Rest-API) provided by Type4Py and changes the API in `config.py` into:\n\n```\nhttp://localhost:PORT/api/predict?tc=0\n```\n\nAccording to our experiments, the Type4Py model has much lower performance by quering the API above, you are suggested to train the model locally and generate the recommendation file which can be passed to `-m`.\n\n**Note: HiTyper's performance deeply depends on the maximum performance of recommendation model (especially the performance to predict argument types). Type inference of HiTyper can fail if the recommendation model cannot give a valid prediction while static inference does not work!** \n\nIf you want to use another more powerful model, you write code like `__main__.py` to adapt HiTyper to your DL model.\n\n### eval\n\n```sh\nhityper eval [-h] -g GROUNDTRUTH -c CLASSIFIED_GROUNDTRUTH -u USERTYPE [-m RECOMMENDATIONS] [-t] [-n TOPN]\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -g GROUNDTRUTH, --groundtruth GROUNDTRUTH\n                        Path to a ground truth dataset\n  -c CLASSIFIED_GROUNDTRUTH, --classified_groundtruth CLASSIFIED_GROUNDTRUTH\n                        Path to a classified ground truth dataset\n  -u USERTYPE, --usertype USERTYPE\n                        Path to a previously collected user-defined type set\n  -m RECOMMENDATIONS, --recommendations RECOMMENDATIONS\n                        Path to the recommendations generated by a DL model\n  -t, --type4py         Use Type4Py as the recommendation model\n  -n TOPN, --topn TOPN  Indicate the top n predictions from DL models used by HiTyper\n```\n\n**Example:**\n\n```sh\nhityper eval -g groundtruth.json -c detailed_groundtruth.json -u usertypes.json -n 1 -t\n```\n\n*This command evaluates the performance of HiTyper on a pre-defined groundtruth dataset. It will output similar results like stated in `Experiment Results` part.*\n\nBefore evaluating Hityper using this command, please use `hityper findusertype` command to generate `usertypes.json`. This typically takes several hours, depending on the number of files.\n\nThis option is designed only for future research evaluation.\n\n### Preprocess\n\n```sh\nusage: hityper preprocess [-h] -p JSON_REPO [-d OUTPUT_DIRECTORY]\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -p JSON_REPO, --json_repo JSON_REPO\n                        Path to the repo of JSON files\n  -d OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY\n                        Path to the transformed datasets\n```\n\n**Example:**\n\n```sh\nhityper preprocess -p ManyTypes4PyDataset/processed_projects_complete -d outputs\n```\n\n*This command transforms the json files in ManyTypes4Py datasets into the `groundtruth.json` and `detailed_groundtruth.json` files that required by the `hityper eval` command.*\n\nThis command is to facilitate the researchers that use ManyTypes4Py dataset and want to evaluate HiTyper in it.\n\nIf you want to run HiTyper in other datasets, please follow the same logic in `transformDataset` function of  `HiTyper/hityper/utils.py` to write a script.\n\n## Experiment Results\n\n**Dataset:**\n\nThe following results are evaluated using the [ManyTypes4Py](https://zenodo.org/record/4719447#.YjxcpBNBxb8) dataset. \n\nSince the original dataset does not contain Python source files, to facilitate future research, we here also attached a [link](https://drive.google.com/file/d/1HdZyd3dKAUkiv2Nl0Zynp_YhrqU6HfMx/view?usp=sharing) for the Python source files HiTyper uses to infer types. Attached dataset is not identical with the original one because the original one contains some GitHub repos that do not allow open access or have been deleted.\n\nNote that as stated in the paper, there exists few cases (such as subtypes and same types with different names) that HiTyper should be correct but still counted as wrong in the evaluation process.\n\n**Metrics:**\n\nFor the definition of metrics used here, please also refer to the paper. These metrics can be regarded as a kind of \"recall\", which evaluates the coverage of HiTyper on a specific dataset. We do not show the \"precision\" here as HiTyper only outputs results when it does not observe any violations with current typing rules and TDG.\n\n**Only using the static inference part:**\n\n| Category           | Exact Match | Match to Parametric | Partial Match |\n| ------------------ | ----------- | ------------------- | ------------- |\n| Simple Types       | 59.00%      | 59.47%              | 62.15%        |\n| Generic Types      | 55.50%      | 69.68%              | 71.90%        |\n| User-defined Types | 40.40%      | 40.40%              | 44.30%        |\n| Arguments          | 7.65%       | 8.05%               | 14.39%        |\n| Return Values      | 58.71%      | 64.61%              | 69.06%        |\n| Local Variables    | 61.56%      | 65.66%              | 67.05%        |\n\nYou can use the following command to reproduce the above results:\n\n```sh\nhityper eval -g ManyTypes4Py_gts_test_verified.json -c ManyTypes4Py_gts_test_verified_detailed.json -u ManyTypes4Py_test_usertypes.json \n```\n\nWe do not show the performance of HiTyper integrating different DL models here since there are many factors impacting the performance of DL models such as datasets, hyper-parameters, etc. Please align the performance by yourself before utilizing recommendations from DL models.\n\nWhat's more, we are currently working on building a DL model that's more suitable for HiTyper. Stay tuned!\n\n**Other datasets:**\n\nIf you want to evaluate HiTyper on other datasets, please generate files with the same format with `ManyTypes4Py_gts_test_verified.json`, `ManyTypes4Py_gts_test_verified_detailed.json`, or you can modify the code in `__main__.py`. To check a type's category, you can use `hityper.typeobject.TypeObject.checkType()`.\n\nIn any case, you must also prepare the source files for static analysis.\n\n**Old results:**\n\nIf you want the exact experiment results stated in the paper, please download them at this [link](https://drive.google.com/file/d/1zFVStp085bfv8WU7UCk9pIE2HEEf-CUh/view?usp=sharing).\n\n## Todo\n\n- Add supports for inter-procedural analysis\n- Add supports for types from third-party modules\n- Add supports for external function calls\n- Add supports for stub files\n\n## Cite Us\n\nIf you use HiTyper in your research, please cite us:\n\n```latex\n@inproceedings{peng22hityper,\nauthor = {Peng, Yun and Gao, Cuiyun and Li, Zongjie and Gao, Bowei and Lo, David and Zhang, Qirun and Lyu, Michael},\ntitle = {Static Inference Meets Deep Learning: A Hybrid Type Inference Approach for Python},\nyear = {2022},\nisbn = {9781450392211},\npublisher = {Association for Computing Machinery},\naddress = {New York, NY, USA},\nurl = {https://doi.org/10.1145/3510003.3510038},\ndoi = {10.1145/3510003.3510038},\nbooktitle = {Proceedings of the 44th International Conference on Software Engineering},\npages = {2019\u20132030},\nnumpages = {12},\nlocation = {Pittsburgh, Pennsylvania},\nseries = {ICSE '22}\n}\n```\n\n## Contact\n\nWe actively maintain this project and welcome contributions. \n\nIf you have any question, please contact research@yunpeng.work.\n\n\n\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "HiTyper: A hybrid type inference framework for Python",
    "version": "1.0.3",
    "project_urls": {
        "Homepage": "https://github.com/JohnnyPeng18/HiTyper"
    },
    "split_keywords": [
        "python",
        "type inference",
        "static analysis"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "06821b3c4d66221dbb71cd7a7b1d577211e744a6f2f4b3b4b478cd14a7719e65",
                "md5": "a83c4048d32238f039850be7597f7ff2",
                "sha256": "48c50843631744f3f80b03fc294db78b987a195b450678ddd3bb5fc6163778b8"
            },
            "downloads": -1,
            "filename": "hityper-1.0.3-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a83c4048d32238f039850be7597f7ff2",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": null,
            "size": 81198,
            "upload_time": "2023-07-18T14:27:43",
            "upload_time_iso_8601": "2023-07-18T14:27:43.531955Z",
            "url": "https://files.pythonhosted.org/packages/06/82/1b3c4d66221dbb71cd7a7b1d577211e744a6f2f4b3b4b478cd14a7719e65/hityper-1.0.3-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "03d42a609a9b54013527b8ea79391497a6b1344494d308fca095997fbeabc5fa",
                "md5": "77168da240a4e8eea9ea0aef9b2a246b",
                "sha256": "ce86ccb1c25931ebe0ed6afe2d58b585ca7539900ee8b5053c3c8174535f336f"
            },
            "downloads": -1,
            "filename": "hityper-1.0.3.tar.gz",
            "has_sig": false,
            "md5_digest": "77168da240a4e8eea9ea0aef9b2a246b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 82048,
            "upload_time": "2023-07-18T14:27:45",
            "upload_time_iso_8601": "2023-07-18T14:27:45.880583Z",
            "url": "https://files.pythonhosted.org/packages/03/d4/2a609a9b54013527b8ea79391497a6b1344494d308fca095997fbeabc5fa/hityper-1.0.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-07-18 14:27:45",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "JohnnyPeng18",
    "github_project": "HiTyper",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "hityper"
}

Yun Peng