[![PyPI version](https://badge.fury.io/py/nleval.svg)](https://badge.fury.io/py/nleval)
[![Documentation Status](https://readthedocs.org/projects/networklearningeval/badge/?version=latest)](https://networklearningeval.readthedocs.io/en/latest/?badge=latest)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)
[![Tests](https://github.com/krishnanlab/NetworkLearningEval/actions/workflows/tests.yml/badge.svg)](https://github.com/krishnanlab/NetworkLearningEval/actions/workflows/tests.yml)
[![Test Examples](https://github.com/krishnanlab/NetworkLearningEval/actions/workflows/examples.yml/badge.svg)](https://github.com/krishnanlab/NetworkLearningEval/actions/workflows/examples.yml)
[![Test Data](https://github.com/krishnanlab/NetworkLearningEval/actions/workflows/test_data.yml/badge.svg)](https://github.com/krishnanlab/NetworkLearningEval/actions/workflows/test_data.yml)
# NetworkLearningEval
## Installation
Clone the repository first and then install via `pip`
```bash
git clone https://github.com/krishnanlab/NetworkLearningEval && cd NetworkLearningEval
pip install -e .
```
The `-e` option means 'editable', i.e. no need to reinstall the library if you make changes to the source code.
Feel free to not use the `-e` option and simply do `pip install .` if you do not plan on modifying the source code.
### Optional Pytorch Geometric installation
User need to install [Pytorch Geomtric](https://github.com/pyg-team/pytorch_geometric) to enable some GNN related features.
To install PyG, first need to install [PyTorch](https://pytorch.org).
For full details about installation instructions, visit the links above.
Assuming the system has Python3.8 or above installed, with CUDA10.2, use the following to install both PyTorch and PyG.
```bash
conda install pytorch=1.12.1 torchvision cudatoolkit=10.2 -c pytorch
pip install torch-geometric==2.0.4 torch-scatter torch-sparse torch-cluster -f https://data.pyg.org/whl/torch-1.12.1+cu102.html
```
### Quick install using the installation script
```bash
source install.sh cu102 # other options are [cpu,cu113]
```
## Quick Demonstration
### Construct default datasets
We provide a high-level dataset constructor to help user effortlessly set up a ML-ready dataset
for a combination of network and label. In particular, the dataset will be set up with study-bias
holdout split (6/2/2), where 60% of the most well studied genes according to the number of
associated PubMed publications are used for training, 20% of the least studied genes are used for
testing, and rest of the 20% genes are used for validation. For more customizable data loading
and processing options, see the [customized dataset construction](#customized-dataset-construction)
section below.
```python
from nleval.util.dataset_constructors import default_constructor
root = "datasets" # save dataset and cache under the datasets/ directory
version = "nledata-v0.1.0-dev3" # archive data version, use 'latest' to pull latest data from source instead
# Download and process network/label data. Use the adjacency matrix as the ML feature
dataset = default_constructor(root=root, version=version, graph_name="BioGRID", label_name="DisGeNET",
graph_as_feature=True, use_dense_graph=True)
```
### Evaluating standard models
Evaluation of simple machine learning methods such as logistic regression and label propagation
can be done easily using the trainer objects. The trainer objects take a dictionary of metrics
as input for evaluating the models' performances, and can be set up as follows.
```python
from nleval.metric import auroc
from nleval.model_trainer import SupervisedLearningTrainer, LabelPropagationTrainer
metrics = {"auroc": auroc} # use AUROC as our default evaluation metric
sl_trainer = SupervisedLearningTrainer(metrics)
lp_trainer = LabelPropagationTrainer(metrics)
```
Then, use the `eval_multi_ovr` method of the trainer to evaluate a given ML model over all tasks
in a one-vs-rest setting.
```python
from sklearn.linear_model import LogisticRegression
from nleval.model.label_propagation import OneHopPropagation
# Initialize models
sl_mdl = LogisticRegression(penalty="l2", solver="lbfgs")
lp_mdl = OneHopPropagation()
# Evaluate the models over all tasks
sl_results = sl_trainer.eval_multi_ovr(sl_mdl, dataset)
lp_results = lp_trainer.eval_multi_ovr(lp_mdl, dataset)
```
### Evaluating GNN models
Training and evaluation of Graph Neural Network (GNN) models can be done in a very similar fashion.
```python
from torch_geometric.nn import GCN
from nleval.model_trainer.gnn import SimpleGNNTrainer
# Use 1-dimensional trivial node feature
dataset = default_constructor(root=root, version=version, graph_name="BioGRID", label_name="DisGeNET")
# Train and evaluate a GCN
gcn_mdl = GCN(in_channels=1, hidden_channels=64, num_layers=5, out_channels=n_tasks)
gcn_trainer = SimpleGNNTrainer(metrics, device="cuda", metric_best="auroc")
gcn_results = gcn_trainer.train(gcn_mdl, dataset)
```
### Customized dataset construction
#### Load network and labels
```python
from nleval import data
root = "datasets" # save dataset and cache under the datasets/ directory
# Load processed BioGRID data from archive.
# Alternatively, set version="latest" to get and process the newest data from scratch.
g = data.BioGRID(root, version="nledata-v0.1.0-dev3")
# Load DisGeNET gene set collections.
lsc = data.DisGeNET(root, version="latest")
```
#### Setting up data and splits
```python
from nleval.util.converter import GenePropertyConverter
from nleval.label.split import RatioHoldout
# Load PubMed count gene property converter and use it to set up study-bias holdout split
pubmedcnt_converter = GenePropertyConverter(root, name="PubMedCount")
splitter = RatioHoldout(0.6, 0.4, ascending=False, property_converter=pubmedcnt_converter)
```
#### Filter labeled data based on network genes and splits
```python
# Apply in-place filters to the labelset collection
lsc.iapply(
filters.Compose(
# Only use genes that are present in the network
filters.EntityExistenceFilter(list(g.node_ids)),
# Remove any labelsets with less than 50 network genes
filters.LabelsetRangeFilterSize(min_val=50),
# Make sure each split has at least 10 positive examples
filters.LabelsetRangeFilterSplit(min_val=10, splitter=splitter),
),
)
```
#### Combine into dataset
```python
from nleval import Dataset
dataset = Dataset(graph=g, feature=g.to_dense_graph().to_feature(), label=lsc, splitter=splitter)
```
## Data preparation and releasing notes
First, bump data version in `__init__.py` to the next data release version, e.g., `nledata-v0.1.0 -> nledata-v0.1.1-dev`.
Then, download and process all latest data by running
```bash
python script/release_data.py
```
By default, the data ready to be uploaded (e.g., to [Zenodo](zenodo.org)) is saved under `data_release/archived`.
After some necessary inspection and checking, if everything looks good, upload and publish the new archived data.
**Note:** `dev` data should be uploaded to the [sandbox](https://sandbox.zenodo.org/record/1097545#.YxYrqezMJzV) instead.
Check items:
- [ ] Update `__data_version__`
- [ ] Run [`release_data.py`](script/release_data.py)
- [ ] Upload archived data to Zenodo (be sure to edit the data version there also)
- [ ] Update url dict in config (will improve in the future to get info from Zenodo directly)
- [ ] Update network stats in data [test](test/test_data.py)
Finally, commit and push the bumped version.
Raw data
{
"_id": null,
"home_page": "",
"name": "nleval",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "",
"keywords": "Data Processing,Gene Classification,Machine Learning,Network Biology,Network Repositories",
"author": "",
"author_email": "Remy Liu <liurenmi@msu.edu>",
"download_url": "https://files.pythonhosted.org/packages/f1/c1/d00dbdf46120b4670d27028ff96a140f6001e83b3f00b8b5d28fd640a0d9/nleval-0.1.0.dev8.tar.gz",
"platform": null,
"description": "[![PyPI version](https://badge.fury.io/py/nleval.svg)](https://badge.fury.io/py/nleval)\n[![Documentation Status](https://readthedocs.org/projects/networklearningeval/badge/?version=latest)](https://networklearningeval.readthedocs.io/en/latest/?badge=latest)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n[![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)\n\n[![Tests](https://github.com/krishnanlab/NetworkLearningEval/actions/workflows/tests.yml/badge.svg)](https://github.com/krishnanlab/NetworkLearningEval/actions/workflows/tests.yml)\n[![Test Examples](https://github.com/krishnanlab/NetworkLearningEval/actions/workflows/examples.yml/badge.svg)](https://github.com/krishnanlab/NetworkLearningEval/actions/workflows/examples.yml)\n[![Test Data](https://github.com/krishnanlab/NetworkLearningEval/actions/workflows/test_data.yml/badge.svg)](https://github.com/krishnanlab/NetworkLearningEval/actions/workflows/test_data.yml)\n\n# NetworkLearningEval\n\n## Installation\n\nClone the repository first and then install via `pip`\n\n```bash\ngit clone https://github.com/krishnanlab/NetworkLearningEval && cd NetworkLearningEval\npip install -e .\n```\n\nThe `-e` option means 'editable', i.e. no need to reinstall the library if you make changes to the source code.\nFeel free to not use the `-e` option and simply do `pip install .` if you do not plan on modifying the source code.\n\n### Optional Pytorch Geometric installation\n\nUser need to install [Pytorch Geomtric](https://github.com/pyg-team/pytorch_geometric) to enable some GNN related features.\nTo install PyG, first need to install [PyTorch](https://pytorch.org).\nFor full details about installation instructions, visit the links above.\nAssuming the system has Python3.8 or above installed, with CUDA10.2, use the following to install both PyTorch and PyG.\n\n```bash\nconda install pytorch=1.12.1 torchvision cudatoolkit=10.2 -c pytorch\npip install torch-geometric==2.0.4 torch-scatter torch-sparse torch-cluster -f https://data.pyg.org/whl/torch-1.12.1+cu102.html\n```\n\n### Quick install using the installation script\n\n```bash\nsource install.sh cu102 # other options are [cpu,cu113]\n```\n\n## Quick Demonstration\n\n### Construct default datasets\n\nWe provide a high-level dataset constructor to help user effortlessly set up a ML-ready dataset\nfor a combination of network and label. In particular, the dataset will be set up with study-bias\nholdout split (6/2/2), where 60% of the most well studied genes according to the number of\nassociated PubMed publications are used for training, 20% of the least studied genes are used for\ntesting, and rest of the 20% genes are used for validation. For more customizable data loading\nand processing options, see the [customized dataset construction](#customized-dataset-construction)\nsection below.\n\n```python\nfrom nleval.util.dataset_constructors import default_constructor\n\nroot = \"datasets\" # save dataset and cache under the datasets/ directory\nversion = \"nledata-v0.1.0-dev3\" # archive data version, use 'latest' to pull latest data from source instead\n\n# Download and process network/label data. Use the adjacency matrix as the ML feature\ndataset = default_constructor(root=root, version=version, graph_name=\"BioGRID\", label_name=\"DisGeNET\",\n graph_as_feature=True, use_dense_graph=True)\n```\n\n### Evaluating standard models\n\nEvaluation of simple machine learning methods such as logistic regression and label propagation\ncan be done easily using the trainer objects. The trainer objects take a dictionary of metrics\nas input for evaluating the models' performances, and can be set up as follows.\n\n```python\nfrom nleval.metric import auroc\nfrom nleval.model_trainer import SupervisedLearningTrainer, LabelPropagationTrainer\n\nmetrics = {\"auroc\": auroc} # use AUROC as our default evaluation metric\nsl_trainer = SupervisedLearningTrainer(metrics)\nlp_trainer = LabelPropagationTrainer(metrics)\n```\n\nThen, use the `eval_multi_ovr` method of the trainer to evaluate a given ML model over all tasks\nin a one-vs-rest setting.\n\n```python\nfrom sklearn.linear_model import LogisticRegression\nfrom nleval.model.label_propagation import OneHopPropagation\n\n# Initialize models\nsl_mdl = LogisticRegression(penalty=\"l2\", solver=\"lbfgs\")\nlp_mdl = OneHopPropagation()\n\n# Evaluate the models over all tasks\nsl_results = sl_trainer.eval_multi_ovr(sl_mdl, dataset)\nlp_results = lp_trainer.eval_multi_ovr(lp_mdl, dataset)\n```\n\n### Evaluating GNN models\n\nTraining and evaluation of Graph Neural Network (GNN) models can be done in a very similar fashion.\n\n```python\nfrom torch_geometric.nn import GCN\nfrom nleval.model_trainer.gnn import SimpleGNNTrainer\n\n# Use 1-dimensional trivial node feature\ndataset = default_constructor(root=root, version=version, graph_name=\"BioGRID\", label_name=\"DisGeNET\")\n\n# Train and evaluate a GCN\ngcn_mdl = GCN(in_channels=1, hidden_channels=64, num_layers=5, out_channels=n_tasks)\ngcn_trainer = SimpleGNNTrainer(metrics, device=\"cuda\", metric_best=\"auroc\")\ngcn_results = gcn_trainer.train(gcn_mdl, dataset)\n```\n\n### Customized dataset construction\n\n#### Load network and labels\n\n```python\nfrom nleval import data\n\nroot = \"datasets\" # save dataset and cache under the datasets/ directory\n\n# Load processed BioGRID data from archive.\n# Alternatively, set version=\"latest\" to get and process the newest data from scratch.\ng = data.BioGRID(root, version=\"nledata-v0.1.0-dev3\")\n\n# Load DisGeNET gene set collections.\nlsc = data.DisGeNET(root, version=\"latest\")\n```\n\n#### Setting up data and splits\n\n```python\nfrom nleval.util.converter import GenePropertyConverter\nfrom nleval.label.split import RatioHoldout\n\n# Load PubMed count gene property converter and use it to set up study-bias holdout split\npubmedcnt_converter = GenePropertyConverter(root, name=\"PubMedCount\")\nsplitter = RatioHoldout(0.6, 0.4, ascending=False, property_converter=pubmedcnt_converter)\n```\n\n#### Filter labeled data based on network genes and splits\n\n```python\n# Apply in-place filters to the labelset collection\nlsc.iapply(\n filters.Compose(\n # Only use genes that are present in the network\n filters.EntityExistenceFilter(list(g.node_ids)),\n # Remove any labelsets with less than 50 network genes\n filters.LabelsetRangeFilterSize(min_val=50),\n # Make sure each split has at least 10 positive examples\n filters.LabelsetRangeFilterSplit(min_val=10, splitter=splitter),\n ),\n)\n```\n\n#### Combine into dataset\n\n```python\nfrom nleval import Dataset\ndataset = Dataset(graph=g, feature=g.to_dense_graph().to_feature(), label=lsc, splitter=splitter)\n```\n\n## Data preparation and releasing notes\n\nFirst, bump data version in `__init__.py` to the next data release version, e.g., `nledata-v0.1.0 -> nledata-v0.1.1-dev`.\nThen, download and process all latest data by running\n\n```bash\npython script/release_data.py\n```\n\nBy default, the data ready to be uploaded (e.g., to [Zenodo](zenodo.org)) is saved under `data_release/archived`.\nAfter some necessary inspection and checking, if everything looks good, upload and publish the new archived data.\n\n**Note:** `dev` data should be uploaded to the [sandbox](https://sandbox.zenodo.org/record/1097545#.YxYrqezMJzV) instead.\n\nCheck items:\n\n- [ ] Update `__data_version__`\n- [ ] Run [`release_data.py`](script/release_data.py)\n- [ ] Upload archived data to Zenodo (be sure to edit the data version there also)\n- [ ] Update url dict in config (will improve in the future to get info from Zenodo directly)\n- [ ] Update network stats in data [test](test/test_data.py)\n\nFinally, commit and push the bumped version.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A Python toolkit for biological network learning evaluation",
"version": "0.1.0.dev8",
"project_urls": {
"bug-tracker": "https://github.com/krishnanlab/NetworkLearningEval/issues",
"home": "https://github.com/krishnanlab/NetworkLearningEval"
},
"split_keywords": [
"data processing",
"gene classification",
"machine learning",
"network biology",
"network repositories"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "83da06482a00dbbd241f0d6123d48d7b9a01cfc5b6643ba8f37486277bef45b4",
"md5": "16ff3810ffa0af01b00fda6d0a984ca7",
"sha256": "0029fa53e72609b7c211c8f97ba3ea1c13508d7edf39c43f7efffe06d0a0b2e7"
},
"downloads": -1,
"filename": "nleval-0.1.0.dev8-py3-none-any.whl",
"has_sig": false,
"md5_digest": "16ff3810ffa0af01b00fda6d0a984ca7",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 141340,
"upload_time": "2023-06-08T15:53:58",
"upload_time_iso_8601": "2023-06-08T15:53:58.337192Z",
"url": "https://files.pythonhosted.org/packages/83/da/06482a00dbbd241f0d6123d48d7b9a01cfc5b6643ba8f37486277bef45b4/nleval-0.1.0.dev8-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "f1c1d00dbdf46120b4670d27028ff96a140f6001e83b3f00b8b5d28fd640a0d9",
"md5": "6e221bb4bd9fbdf1c5460150e6ecfc2c",
"sha256": "212397897681cbf864992a88d92f3b4fc2cd5fc72ff54ec7f68214867825c39c"
},
"downloads": -1,
"filename": "nleval-0.1.0.dev8.tar.gz",
"has_sig": false,
"md5_digest": "6e221bb4bd9fbdf1c5460150e6ecfc2c",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 115172,
"upload_time": "2023-06-08T15:54:00",
"upload_time_iso_8601": "2023-06-08T15:54:00.738726Z",
"url": "https://files.pythonhosted.org/packages/f1/c1/d00dbdf46120b4670d27028ff96a140f6001e83b3f00b8b5d28fd640a0d9/nleval-0.1.0.dev8.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-06-08 15:54:00",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "krishnanlab",
"github_project": "NetworkLearningEval",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"tox": true,
"lcname": "nleval"
}