[![PyPI version](https://badge.fury.io/py/nleval.svg)](https://badge.fury.io/py/nleval)
[![Documentation Status](https://readthedocs.org/projects/networklearningeval/badge/?version=latest)](https://networklearningeval.readthedocs.io/en/latest/?badge=latest)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)
[![Test Examples](https://github.com/krishnanlab/NetworkLearningEval/actions/workflows/examples.yml/badge.svg)](https://github.com/krishnanlab/NetworkLearningEval/actions/workflows/examples.yml)
[![Test Data](https://github.com/krishnanlab/NetworkLearningEval/actions/workflows/test_data.yml/badge.svg)](https://github.com/krishnanlab/NetworkLearningEval/actions/workflows/test_data.yml)
# NetworkLearningEval
## Installation
Clone the repository first and then install via `pip`
git clone https://github.com/krishnanlab/NetworkLearningEval && cd NetworkLearningEval
pip install -e .
The `-e` option means 'editable', i.e. no need to reinstall the library if you make changes to the source code.
Feel free to not use the `-e` option and simply do `pip install .` if you do not plan on modifying the source code.
### Optional Pytorch Geometric installation
User need to install [Pytorch Geomtric](https://github.com/pyg-team/pytorch_geometric) to enable some GNN related features.
To install PyG, first need to install [PyTorch](https://pytorch.org).
For full details about installation instructions, visit the links above.
Assuming the system has Python3.8 or above installed, with CUDA10.2, use the following to install both PyTorch and PyG.
conda install pytorch=1.12.1 torchvision cudatoolkit=10.2 -c pytorch
pip install torch-geometric==2.0.4 torch-scatter torch-sparse torch-cluster -f https://data.pyg.org/whl/torch-1.12.1+cu102.html
### Quick install using the installation script
source install.sh cu102 # other options are [cpu,cu113]
## Quick Demonstration
### Construct default datasets
We provide a high-level dataset constructor to help user effortlessly set up a ML-ready dataset
for a combination of network and label. In particular, the dataset will be set up with study-bias
holdout split (6/2/2), where 60% of the most well studied genes according to the number of
associated PubMed publications are used for training, 20% of the least studied genes are used for
testing, and rest of the 20% genes are used for validation. For more customizable data loading
and processing options, see the [customized dataset construction](#customized-dataset-construction)
section below.
from nleval.util.dataset_constructors import default_constructor
root = "datasets" # save dataset and cache under the datasets/ directory
version = "nledata-v0.1.0-dev3" # archive data version, use 'latest' to pull latest data from source instead
# Download and process network/label data. Use the adjacency matrix as the ML feature
dataset = default_constructor(root=root, version=version, graph_name="BioGRID", label_name="DisGeNET",
graph_as_feature=True, use_dense_graph=True)
### Evaluating standard models
Evaluation of simple machine learning methods such as logistic regression and label propagation
can be done easily using the trainer objects. The trainer objects take a dictionary of metrics
as input for evaluating the models' performances, and can be set up as follows.
from nleval.metric import auroc
from nleval.model_trainer import SupervisedLearningTrainer, LabelPropagationTrainer
metrics = {"auroc": auroc} # use AUROC as our default evaluation metric
sl_trainer = SupervisedLearningTrainer(metrics)
lp_trainer = LabelPropagationTrainer(metrics)
Then, use the `eval_multi_ovr` method of the trainer to evaluate a given ML model over all tasks
in a one-vs-rest setting.
from sklearn.linear_model import LogisticRegression
from nleval.model.label_propagation import OneHopPropagation
# Initialize models
sl_mdl = LogisticRegression(penalty="l2", solver="lbfgs")
lp_mdl = OneHopPropagation()
# Evaluate the models over all tasks
sl_results = sl_trainer.eval_multi_ovr(sl_mdl, dataset)
lp_results = lp_trainer.eval_multi_ovr(lp_mdl, dataset)
### Evaluating GNN models
Training and evaluation of Graph Neural Network (GNN) models can be done in a very similar fashion.
from torch_geometric.nn import GCN
from nleval.model_trainer.gnn import SimpleGNNTrainer
# Use 1-dimensional trivial node feature
dataset = default_constructor(root=root, version=version, graph_name="BioGRID", label_name="DisGeNET")
# Train and evaluate a GCN
gcn_mdl = GCN(in_channels=1, hidden_channels=64, num_layers=5, out_channels=n_tasks)
gcn_trainer = SimpleGNNTrainer(metrics, device="cuda", metric_best="auroc")
gcn_results = gcn_trainer.train(gcn_mdl, dataset)
### Customized dataset construction
#### Load network and labels
from nleval import data
root = "datasets" # save dataset and cache under the datasets/ directory
# Load processed BioGRID data from archive.
# Alternatively, set version="latest" to get and process the newest data from scratch.
g = data.BioGRID(root, version="nledata-v0.1.0-dev3")
# Load DisGeNET gene set collections.
lsc = data.DisGeNET(root, version="latest")
#### Setting up data and splits
from nleval.util.converter import GenePropertyConverter
from nleval.label.split import RatioHoldout
# Load PubMed count gene property converter and use it to set up study-bias holdout split
pubmedcnt_converter = GenePropertyConverter(root, name="PubMedCount")
splitter = RatioHoldout(0.6, 0.4, ascending=False, property_converter=pubmedcnt_converter)
#### Filter labeled data based on network genes and splits
# Apply in-place filters to the labelset collection
# Only use genes that are present in the network
# Remove any labelsets with less than 50 network genes
# Make sure each split has at least 10 positive examples
filters.LabelsetRangeFilterSplit(min_val=10, splitter=splitter),
#### Combine into dataset
from nleval import Dataset
dataset = Dataset(graph=g, feature=g.to_dense_graph().to_feature(), label=lsc, splitter=splitter)
## Data preparation and releasing notes
First, bump data version in `__init__.py` to the next data release version, e.g., `nledata-v0.1.0 -> nledata-v0.1.1-dev`.
Then, download and process all latest data by running
python script/release_data.py
By default, the data ready to be uploaded (e.g., to [Zenodo](zenodo.org)) is saved under `data_release/archived`.
After some necessary inspection and checking, if everything looks good, upload and publish the new archived data.
**Note:** `dev` data should be uploaded to the [sandbox](https://sandbox.zenodo.org/record/1097545#.YxYrqezMJzV) instead.
Check items:
- [ ] Update `__data_version__`
- [ ] Run [`release_data.py`](script/release_data.py)
- [ ] Upload archived data to Zenodo (be sure to edit the data version there also)
- [ ] Update url dict in config (will improve in the future to get info from Zenodo directly)
- [ ] Update network stats in data [test](test/test_data.py)
Finally, commit and push the bumped version.
Raw data
"_id": null,
"home_page": "",
"name": "nleval",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "",
"keywords": "Data Processing,Gene Classification,Machine Learning,Network Biology,Network Repositories",
"author": "",
"author_email": "Remy Liu <liurenmi@msu.edu>",
"download_url": "https://files.pythonhosted.org/packages/f1/c1/d00dbdf46120b4670d27028ff96a140f6001e83b3f00b8b5d28fd640a0d9/nleval-0.1.0.dev8.tar.gz",
"platform": null,
"description": "[![PyPI version](https://badge.fury.io/py/nleval.svg)](https://badge.fury.io/py/nleval)\n[![Documentation Status](https://readthedocs.org/projects/networklearningeval/badge/?version=latest)](https://networklearningeval.readthedocs.io/en/latest/?badge=latest)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n[![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)\n\n[![Tests](https://github.com/krishnanlab/NetworkLearningEval/actions/workflows/tests.yml/badge.svg)](https://github.com/krishnanlab/NetworkLearningEval/actions/workflows/tests.yml)\n[![Test Examples](https://github.com/krishnanlab/NetworkLearningEval/actions/workflows/examples.yml/badge.svg)](https://github.com/krishnanlab/NetworkLearningEval/actions/workflows/examples.yml)\n[![Test Data](https://github.com/krishnanlab/NetworkLearningEval/actions/workflows/test_data.yml/badge.svg)](https://github.com/krishnanlab/NetworkLearningEval/actions/workflows/test_data.yml)\n\n# NetworkLearningEval\n\n## Installation\n\nClone the repository first and then install via `pip`\n\n```bash\ngit clone https://github.com/krishnanlab/NetworkLearningEval && cd NetworkLearningEval\npip install -e .\n```\n\nThe `-e` option means 'editable', i.e. no need to reinstall the library if you make changes to the source code.\nFeel free to not use the `-e` option and simply do `pip install .` if you do not plan on modifying the source code.\n\n### Optional Pytorch Geometric installation\n\nUser need to install [Pytorch Geomtric](https://github.com/pyg-team/pytorch_geometric) to enable some GNN related features.\nTo install PyG, first need to install [PyTorch](https://pytorch.org).\nFor full details about installation instructions, visit the links above.\nAssuming the system has Python3.8 or above installed, with CUDA10.2, use the following to install both PyTorch and PyG.\n\n```bash\nconda install pytorch=1.12.1 torchvision cudatoolkit=10.2 -c pytorch\npip install torch-geometric==2.0.4 torch-scatter torch-sparse torch-cluster -f https://data.pyg.org/whl/torch-1.12.1+cu102.html\n```\n\n### Quick install using the installation script\n\n```bash\nsource install.sh cu102 # other options are [cpu,cu113]\n```\n\n## Quick Demonstration\n\n### Construct default datasets\n\nWe provide a high-level dataset constructor to help user effortlessly set up a ML-ready dataset\nfor a combination of network and label. In particular, the dataset will be set up with study-bias\nholdout split (6/2/2), where 60% of the most well studied genes according to the number of\nassociated PubMed publications are used for training, 20% of the least studied genes are used for\ntesting, and rest of the 20% genes are used for validation. For more customizable data loading\nand processing options, see the [customized dataset construction](#customized-dataset-construction)\nsection below.\n\n```python\nfrom nleval.util.dataset_constructors import default_constructor\n\nroot = \"datasets\" # save dataset and cache under the datasets/ directory\nversion = \"nledata-v0.1.0-dev3\" # archive data version, use 'latest' to pull latest data from source instead\n\n# Download and process network/label data. Use the adjacency matrix as the ML feature\ndataset = default_constructor(root=root, version=version, graph_name=\"BioGRID\", label_name=\"DisGeNET\",\n graph_as_feature=True, use_dense_graph=True)\n```\n\n### Evaluating standard models\n\nEvaluation of simple machine learning methods such as logistic regression and label propagation\ncan be done easily using the trainer objects. The trainer objects take a dictionary of metrics\nas input for evaluating the models' performances, and can be set up as follows.\n\n```python\nfrom nleval.metric import auroc\nfrom nleval.model_trainer import SupervisedLearningTrainer, LabelPropagationTrainer\n\nmetrics = {\"auroc\": auroc} # use AUROC as our default evaluation metric\nsl_trainer = SupervisedLearningTrainer(metrics)\nlp_trainer = LabelPropagationTrainer(metrics)\n```\n\nThen, use the `eval_multi_ovr` method of the trainer to evaluate a given ML model over all tasks\nin a one-vs-rest setting.\n\n```python\nfrom sklearn.linear_model import LogisticRegression\nfrom nleval.model.label_propagation import OneHopPropagation\n\n# Initialize models\nsl_mdl = LogisticRegression(penalty=\"l2\", solver=\"lbfgs\")\nlp_mdl = OneHopPropagation()\n\n# Evaluate the models over all tasks\nsl_results = sl_trainer.eval_multi_ovr(sl_mdl, dataset)\nlp_results = lp_trainer.eval_multi_ovr(lp_mdl, dataset)\n```\n\n### Evaluating GNN models\n\nTraining and evaluation of Graph Neural Network (GNN) models can be done in a very similar fashion.\n\n```python\nfrom torch_geometric.nn import GCN\nfrom nleval.model_trainer.gnn import SimpleGNNTrainer\n\n# Use 1-dimensional trivial node feature\ndataset = default_constructor(root=root, version=version, graph_name=\"BioGRID\", label_name=\"DisGeNET\")\n\n# Train and evaluate a GCN\ngcn_mdl = GCN(in_channels=1, hidden_channels=64, num_layers=5, out_channels=n_tasks)\ngcn_trainer = SimpleGNNTrainer(metrics, device=\"cuda\", metric_best=\"auroc\")\ngcn_results = gcn_trainer.train(gcn_mdl, dataset)\n```\n\n### Customized dataset construction\n\n#### Load network and labels\n\n```python\nfrom nleval import data\n\nroot = \"datasets\" # save dataset and cache under the datasets/ directory\n\n# Load processed BioGRID data from archive.\n# Alternatively, set version=\"latest\" to get and process the newest data from scratch.\ng = data.BioGRID(root, version=\"nledata-v0.1.0-dev3\")\n\n# Load DisGeNET gene set collections.\nlsc = data.DisGeNET(root, version=\"latest\")\n```\n\n#### Setting up data and splits\n\n```python\nfrom nleval.util.converter import GenePropertyConverter\nfrom nleval.label.split import RatioHoldout\n\n# Load PubMed count gene property converter and use it to set up study-bias holdout split\npubmedcnt_converter = GenePropertyConverter(root, name=\"PubMedCount\")\nsplitter = RatioHoldout(0.6, 0.4, ascending=False, property_converter=pubmedcnt_converter)\n```\n\n#### Filter labeled data based on network genes and splits\n\n```python\n# Apply in-place filters to the labelset collection\nlsc.iapply(\n filters.Compose(\n # Only use genes that are present in the network\n filters.EntityExistenceFilter(list(g.node_ids)),\n # Remove any labelsets with less than 50 network genes\n filters.LabelsetRangeFilterSize(min_val=50),\n # Make sure each split has at least 10 positive examples\n filters.LabelsetRangeFilterSplit(min_val=10, splitter=splitter),\n ),\n)\n```\n\n#### Combine into dataset\n\n```python\nfrom nleval import Dataset\ndataset = Dataset(graph=g, feature=g.to_dense_graph().to_feature(), label=lsc, splitter=splitter)\n```\n\n## Data preparation and releasing notes\n\nFirst, bump data version in `__init__.py` to the next data release version, e.g., `nledata-v0.1.0 -> nledata-v0.1.1-dev`.\nThen, download and process all latest data by running\n\n```bash\npython script/release_data.py\n```\n\nBy default, the data ready to be uploaded (e.g., to [Zenodo](zenodo.org)) is saved under `data_release/archived`.\nAfter some necessary inspection and checking, if everything looks good, upload and publish the new archived data.\n\n**Note:** `dev` data should be uploaded to the [sandbox](https://sandbox.zenodo.org/record/1097545#.YxYrqezMJzV) instead.\n\nCheck items:\n\n- [ ] Update `__data_version__`\n- [ ] Run [`release_data.py`](script/release_data.py)\n- [ ] Upload archived data to Zenodo (be sure to edit the data version there also)\n- [ ] Update url dict in config (will improve in the future to get info from Zenodo directly)\n- [ ] Update network stats in data [test](test/test_data.py)\n\nFinally, commit and push the bumped version.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A Python toolkit for biological network learning evaluation",
"version": "0.1.0.dev8",
"project_urls": {
"bug-tracker": "https://github.com/krishnanlab/NetworkLearningEval/issues",
"home": "https://github.com/krishnanlab/NetworkLearningEval"
"split_keywords": [
"data processing",
"gene classification",
"machine learning",
"network biology",
"network repositories"
"urls": [
"comment_text": "",
"digests": {
"blake2b_256": "83da06482a00dbbd241f0d6123d48d7b9a01cfc5b6643ba8f37486277bef45b4",
"md5": "16ff3810ffa0af01b00fda6d0a984ca7",
"sha256": "0029fa53e72609b7c211c8f97ba3ea1c13508d7edf39c43f7efffe06d0a0b2e7"
"downloads": -1,
"filename": "nleval-0.1.0.dev8-py3-none-any.whl",
"has_sig": false,
"md5_digest": "16ff3810ffa0af01b00fda6d0a984ca7",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 141340,
"upload_time": "2023-06-08T15:53:58",
"upload_time_iso_8601": "2023-06-08T15:53:58.337192Z",
"url": "https://files.pythonhosted.org/packages/83/da/06482a00dbbd241f0d6123d48d7b9a01cfc5b6643ba8f37486277bef45b4/nleval-0.1.0.dev8-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
"comment_text": "",
"digests": {
"blake2b_256": "f1c1d00dbdf46120b4670d27028ff96a140f6001e83b3f00b8b5d28fd640a0d9",
"md5": "6e221bb4bd9fbdf1c5460150e6ecfc2c",
"sha256": "212397897681cbf864992a88d92f3b4fc2cd5fc72ff54ec7f68214867825c39c"
"downloads": -1,
"filename": "nleval-0.1.0.dev8.tar.gz",
"has_sig": false,
"md5_digest": "6e221bb4bd9fbdf1c5460150e6ecfc2c",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 115172,
"upload_time": "2023-06-08T15:54:00",
"upload_time_iso_8601": "2023-06-08T15:54:00.738726Z",
"url": "https://files.pythonhosted.org/packages/f1/c1/d00dbdf46120b4670d27028ff96a140f6001e83b3f00b8b5d28fd640a0d9/nleval-0.1.0.dev8.tar.gz",
"yanked": false,
"yanked_reason": null
"upload_time": "2023-06-08 15:54:00",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "krishnanlab",
"github_project": "NetworkLearningEval",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"tox": true,
"lcname": "nleval"