nleval


Namenleval JSON
Version 0.1.0.dev8 PyPI version JSON
download
home_page
SummaryA Python toolkit for biological network learning evaluation
upload_time2023-06-08 15:54:00
maintainer
docs_urlNone
author
requires_python>=3.8
licenseMIT
keywords data processing gene classification machine learning network biology network repositories
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [![PyPI version](https://badge.fury.io/py/nleval.svg)](https://badge.fury.io/py/nleval)
[![Documentation Status](https://readthedocs.org/projects/networklearningeval/badge/?version=latest)](https://networklearningeval.readthedocs.io/en/latest/?badge=latest)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)

[![Tests](https://github.com/krishnanlab/NetworkLearningEval/actions/workflows/tests.yml/badge.svg)](https://github.com/krishnanlab/NetworkLearningEval/actions/workflows/tests.yml)
[![Test Examples](https://github.com/krishnanlab/NetworkLearningEval/actions/workflows/examples.yml/badge.svg)](https://github.com/krishnanlab/NetworkLearningEval/actions/workflows/examples.yml)
[![Test Data](https://github.com/krishnanlab/NetworkLearningEval/actions/workflows/test_data.yml/badge.svg)](https://github.com/krishnanlab/NetworkLearningEval/actions/workflows/test_data.yml)

# NetworkLearningEval

## Installation

Clone the repository first and then install via `pip`

```bash
git clone https://github.com/krishnanlab/NetworkLearningEval && cd NetworkLearningEval
pip install -e .
```

The `-e` option means 'editable', i.e. no need to reinstall the library if you make changes to the source code.
Feel free to not use the `-e` option and simply do `pip install .` if you do not plan on modifying the source code.

### Optional Pytorch Geometric installation

User need to install [Pytorch Geomtric](https://github.com/pyg-team/pytorch_geometric) to enable some GNN related features.
To install PyG, first need to install [PyTorch](https://pytorch.org).
For full details about installation instructions, visit the links above.
Assuming the system has Python3.8 or above installed, with CUDA10.2, use the following to install both PyTorch and PyG.

```bash
conda install pytorch=1.12.1 torchvision cudatoolkit=10.2 -c pytorch
pip install torch-geometric==2.0.4 torch-scatter torch-sparse torch-cluster -f https://data.pyg.org/whl/torch-1.12.1+cu102.html
```

### Quick install using the installation script

```bash
source install.sh cu102  # other options are [cpu,cu113]
```

## Quick Demonstration

### Construct default datasets

We provide a high-level dataset constructor to help user effortlessly set up a ML-ready dataset
for a combination of network and label. In particular, the dataset will be set up with study-bias
holdout split (6/2/2), where 60% of the most well studied genes according to the number of
associated PubMed publications are used for training, 20% of the least studied genes are used for
testing, and rest of the 20% genes are used for validation. For more customizable data loading
and processing options, see the [customized dataset construction](#customized-dataset-construction)
section below.

```python
from nleval.util.dataset_constructors import default_constructor

root = "datasets"  # save dataset and cache under the datasets/ directory
version = "nledata-v0.1.0-dev3"  # archive data version, use 'latest' to pull latest data from source instead

# Download and process network/label data. Use the adjacency matrix as the ML feature
dataset = default_constructor(root=root, version=version, graph_name="BioGRID", label_name="DisGeNET",
                              graph_as_feature=True, use_dense_graph=True)
```

### Evaluating standard models

Evaluation of simple machine learning methods such as logistic regression and label propagation
can be done easily using the trainer objects. The trainer objects take a dictionary of metrics
as input for evaluating the models' performances, and can be set up as follows.

```python
from nleval.metric import auroc
from nleval.model_trainer import SupervisedLearningTrainer, LabelPropagationTrainer

metrics = {"auroc": auroc}  # use AUROC as our default evaluation metric
sl_trainer = SupervisedLearningTrainer(metrics)
lp_trainer = LabelPropagationTrainer(metrics)
```

Then, use the `eval_multi_ovr` method of the trainer to evaluate a given ML model over all tasks
in a one-vs-rest setting.

```python
from sklearn.linear_model import LogisticRegression
from nleval.model.label_propagation import OneHopPropagation

# Initialize models
sl_mdl = LogisticRegression(penalty="l2", solver="lbfgs")
lp_mdl = OneHopPropagation()

# Evaluate the models over all tasks
sl_results = sl_trainer.eval_multi_ovr(sl_mdl, dataset)
lp_results = lp_trainer.eval_multi_ovr(lp_mdl, dataset)
```

### Evaluating GNN models

Training and evaluation of Graph Neural Network (GNN) models can be done in a very similar fashion.

```python
from torch_geometric.nn import GCN
from nleval.model_trainer.gnn import SimpleGNNTrainer

# Use 1-dimensional trivial node feature
dataset = default_constructor(root=root, version=version, graph_name="BioGRID", label_name="DisGeNET")

# Train and evaluate a GCN
gcn_mdl = GCN(in_channels=1, hidden_channels=64, num_layers=5, out_channels=n_tasks)
gcn_trainer = SimpleGNNTrainer(metrics, device="cuda", metric_best="auroc")
gcn_results = gcn_trainer.train(gcn_mdl, dataset)
```

### Customized dataset construction

#### Load network and labels

```python
from nleval import data

root = "datasets"  # save dataset and cache under the datasets/ directory

# Load processed BioGRID data from archive.
# Alternatively, set version="latest" to get and process the newest data from scratch.
g = data.BioGRID(root, version="nledata-v0.1.0-dev3")

# Load DisGeNET gene set collections.
lsc = data.DisGeNET(root, version="latest")
```

#### Setting up data and splits

```python
from nleval.util.converter import GenePropertyConverter
from nleval.label.split import RatioHoldout

# Load PubMed count gene property converter and use it to set up study-bias holdout split
pubmedcnt_converter = GenePropertyConverter(root, name="PubMedCount")
splitter = RatioHoldout(0.6, 0.4, ascending=False, property_converter=pubmedcnt_converter)
```

#### Filter labeled data based on network genes and splits

```python
# Apply in-place filters to the labelset collection
lsc.iapply(
    filters.Compose(
        # Only use genes that are present in the network
        filters.EntityExistenceFilter(list(g.node_ids)),
        # Remove any labelsets with less than 50 network genes
        filters.LabelsetRangeFilterSize(min_val=50),
        # Make sure each split has at least 10 positive examples
        filters.LabelsetRangeFilterSplit(min_val=10, splitter=splitter),
    ),
)
```

#### Combine into dataset

```python
from nleval import Dataset
dataset = Dataset(graph=g, feature=g.to_dense_graph().to_feature(), label=lsc, splitter=splitter)
```

## Data preparation and releasing notes

First, bump data version in `__init__.py` to the next data release version, e.g., `nledata-v0.1.0 -> nledata-v0.1.1-dev`.
Then, download and process all latest data by running

```bash
python script/release_data.py
```

By default, the data ready to be uploaded (e.g., to [Zenodo](zenodo.org)) is saved under `data_release/archived`.
After some necessary inspection and checking, if everything looks good, upload and publish the new archived data.

**Note:** `dev` data should be uploaded to the [sandbox](https://sandbox.zenodo.org/record/1097545#.YxYrqezMJzV) instead.

Check items:

- [ ] Update `__data_version__`
- [ ] Run [`release_data.py`](script/release_data.py)
- [ ] Upload archived data to Zenodo (be sure to edit the data version there also)
- [ ] Update url dict in config (will improve in the future to get info from Zenodo directly)
- [ ] Update network stats in data [test](test/test_data.py)

Finally, commit and push the bumped version.

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "nleval",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "Data Processing,Gene Classification,Machine Learning,Network Biology,Network Repositories",
    "author": "",
    "author_email": "Remy Liu <liurenmi@msu.edu>",
    "download_url": "https://files.pythonhosted.org/packages/f1/c1/d00dbdf46120b4670d27028ff96a140f6001e83b3f00b8b5d28fd640a0d9/nleval-0.1.0.dev8.tar.gz",
    "platform": null,
    "description": "[![PyPI version](https://badge.fury.io/py/nleval.svg)](https://badge.fury.io/py/nleval)\n[![Documentation Status](https://readthedocs.org/projects/networklearningeval/badge/?version=latest)](https://networklearningeval.readthedocs.io/en/latest/?badge=latest)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n[![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)\n\n[![Tests](https://github.com/krishnanlab/NetworkLearningEval/actions/workflows/tests.yml/badge.svg)](https://github.com/krishnanlab/NetworkLearningEval/actions/workflows/tests.yml)\n[![Test Examples](https://github.com/krishnanlab/NetworkLearningEval/actions/workflows/examples.yml/badge.svg)](https://github.com/krishnanlab/NetworkLearningEval/actions/workflows/examples.yml)\n[![Test Data](https://github.com/krishnanlab/NetworkLearningEval/actions/workflows/test_data.yml/badge.svg)](https://github.com/krishnanlab/NetworkLearningEval/actions/workflows/test_data.yml)\n\n# NetworkLearningEval\n\n## Installation\n\nClone the repository first and then install via `pip`\n\n```bash\ngit clone https://github.com/krishnanlab/NetworkLearningEval && cd NetworkLearningEval\npip install -e .\n```\n\nThe `-e` option means 'editable', i.e. no need to reinstall the library if you make changes to the source code.\nFeel free to not use the `-e` option and simply do `pip install .` if you do not plan on modifying the source code.\n\n### Optional Pytorch Geometric installation\n\nUser need to install [Pytorch Geomtric](https://github.com/pyg-team/pytorch_geometric) to enable some GNN related features.\nTo install PyG, first need to install [PyTorch](https://pytorch.org).\nFor full details about installation instructions, visit the links above.\nAssuming the system has Python3.8 or above installed, with CUDA10.2, use the following to install both PyTorch and PyG.\n\n```bash\nconda install pytorch=1.12.1 torchvision cudatoolkit=10.2 -c pytorch\npip install torch-geometric==2.0.4 torch-scatter torch-sparse torch-cluster -f https://data.pyg.org/whl/torch-1.12.1+cu102.html\n```\n\n### Quick install using the installation script\n\n```bash\nsource install.sh cu102  # other options are [cpu,cu113]\n```\n\n## Quick Demonstration\n\n### Construct default datasets\n\nWe provide a high-level dataset constructor to help user effortlessly set up a ML-ready dataset\nfor a combination of network and label. In particular, the dataset will be set up with study-bias\nholdout split (6/2/2), where 60% of the most well studied genes according to the number of\nassociated PubMed publications are used for training, 20% of the least studied genes are used for\ntesting, and rest of the 20% genes are used for validation. For more customizable data loading\nand processing options, see the [customized dataset construction](#customized-dataset-construction)\nsection below.\n\n```python\nfrom nleval.util.dataset_constructors import default_constructor\n\nroot = \"datasets\"  # save dataset and cache under the datasets/ directory\nversion = \"nledata-v0.1.0-dev3\"  # archive data version, use 'latest' to pull latest data from source instead\n\n# Download and process network/label data. Use the adjacency matrix as the ML feature\ndataset = default_constructor(root=root, version=version, graph_name=\"BioGRID\", label_name=\"DisGeNET\",\n                              graph_as_feature=True, use_dense_graph=True)\n```\n\n### Evaluating standard models\n\nEvaluation of simple machine learning methods such as logistic regression and label propagation\ncan be done easily using the trainer objects. The trainer objects take a dictionary of metrics\nas input for evaluating the models' performances, and can be set up as follows.\n\n```python\nfrom nleval.metric import auroc\nfrom nleval.model_trainer import SupervisedLearningTrainer, LabelPropagationTrainer\n\nmetrics = {\"auroc\": auroc}  # use AUROC as our default evaluation metric\nsl_trainer = SupervisedLearningTrainer(metrics)\nlp_trainer = LabelPropagationTrainer(metrics)\n```\n\nThen, use the `eval_multi_ovr` method of the trainer to evaluate a given ML model over all tasks\nin a one-vs-rest setting.\n\n```python\nfrom sklearn.linear_model import LogisticRegression\nfrom nleval.model.label_propagation import OneHopPropagation\n\n# Initialize models\nsl_mdl = LogisticRegression(penalty=\"l2\", solver=\"lbfgs\")\nlp_mdl = OneHopPropagation()\n\n# Evaluate the models over all tasks\nsl_results = sl_trainer.eval_multi_ovr(sl_mdl, dataset)\nlp_results = lp_trainer.eval_multi_ovr(lp_mdl, dataset)\n```\n\n### Evaluating GNN models\n\nTraining and evaluation of Graph Neural Network (GNN) models can be done in a very similar fashion.\n\n```python\nfrom torch_geometric.nn import GCN\nfrom nleval.model_trainer.gnn import SimpleGNNTrainer\n\n# Use 1-dimensional trivial node feature\ndataset = default_constructor(root=root, version=version, graph_name=\"BioGRID\", label_name=\"DisGeNET\")\n\n# Train and evaluate a GCN\ngcn_mdl = GCN(in_channels=1, hidden_channels=64, num_layers=5, out_channels=n_tasks)\ngcn_trainer = SimpleGNNTrainer(metrics, device=\"cuda\", metric_best=\"auroc\")\ngcn_results = gcn_trainer.train(gcn_mdl, dataset)\n```\n\n### Customized dataset construction\n\n#### Load network and labels\n\n```python\nfrom nleval import data\n\nroot = \"datasets\"  # save dataset and cache under the datasets/ directory\n\n# Load processed BioGRID data from archive.\n# Alternatively, set version=\"latest\" to get and process the newest data from scratch.\ng = data.BioGRID(root, version=\"nledata-v0.1.0-dev3\")\n\n# Load DisGeNET gene set collections.\nlsc = data.DisGeNET(root, version=\"latest\")\n```\n\n#### Setting up data and splits\n\n```python\nfrom nleval.util.converter import GenePropertyConverter\nfrom nleval.label.split import RatioHoldout\n\n# Load PubMed count gene property converter and use it to set up study-bias holdout split\npubmedcnt_converter = GenePropertyConverter(root, name=\"PubMedCount\")\nsplitter = RatioHoldout(0.6, 0.4, ascending=False, property_converter=pubmedcnt_converter)\n```\n\n#### Filter labeled data based on network genes and splits\n\n```python\n# Apply in-place filters to the labelset collection\nlsc.iapply(\n    filters.Compose(\n        # Only use genes that are present in the network\n        filters.EntityExistenceFilter(list(g.node_ids)),\n        # Remove any labelsets with less than 50 network genes\n        filters.LabelsetRangeFilterSize(min_val=50),\n        # Make sure each split has at least 10 positive examples\n        filters.LabelsetRangeFilterSplit(min_val=10, splitter=splitter),\n    ),\n)\n```\n\n#### Combine into dataset\n\n```python\nfrom nleval import Dataset\ndataset = Dataset(graph=g, feature=g.to_dense_graph().to_feature(), label=lsc, splitter=splitter)\n```\n\n## Data preparation and releasing notes\n\nFirst, bump data version in `__init__.py` to the next data release version, e.g., `nledata-v0.1.0 -> nledata-v0.1.1-dev`.\nThen, download and process all latest data by running\n\n```bash\npython script/release_data.py\n```\n\nBy default, the data ready to be uploaded (e.g., to [Zenodo](zenodo.org)) is saved under `data_release/archived`.\nAfter some necessary inspection and checking, if everything looks good, upload and publish the new archived data.\n\n**Note:** `dev` data should be uploaded to the [sandbox](https://sandbox.zenodo.org/record/1097545#.YxYrqezMJzV) instead.\n\nCheck items:\n\n- [ ] Update `__data_version__`\n- [ ] Run [`release_data.py`](script/release_data.py)\n- [ ] Upload archived data to Zenodo (be sure to edit the data version there also)\n- [ ] Update url dict in config (will improve in the future to get info from Zenodo directly)\n- [ ] Update network stats in data [test](test/test_data.py)\n\nFinally, commit and push the bumped version.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A Python toolkit for biological network learning evaluation",
    "version": "0.1.0.dev8",
    "project_urls": {
        "bug-tracker": "https://github.com/krishnanlab/NetworkLearningEval/issues",
        "home": "https://github.com/krishnanlab/NetworkLearningEval"
    },
    "split_keywords": [
        "data processing",
        "gene classification",
        "machine learning",
        "network biology",
        "network repositories"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "83da06482a00dbbd241f0d6123d48d7b9a01cfc5b6643ba8f37486277bef45b4",
                "md5": "16ff3810ffa0af01b00fda6d0a984ca7",
                "sha256": "0029fa53e72609b7c211c8f97ba3ea1c13508d7edf39c43f7efffe06d0a0b2e7"
            },
            "downloads": -1,
            "filename": "nleval-0.1.0.dev8-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "16ff3810ffa0af01b00fda6d0a984ca7",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 141340,
            "upload_time": "2023-06-08T15:53:58",
            "upload_time_iso_8601": "2023-06-08T15:53:58.337192Z",
            "url": "https://files.pythonhosted.org/packages/83/da/06482a00dbbd241f0d6123d48d7b9a01cfc5b6643ba8f37486277bef45b4/nleval-0.1.0.dev8-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f1c1d00dbdf46120b4670d27028ff96a140f6001e83b3f00b8b5d28fd640a0d9",
                "md5": "6e221bb4bd9fbdf1c5460150e6ecfc2c",
                "sha256": "212397897681cbf864992a88d92f3b4fc2cd5fc72ff54ec7f68214867825c39c"
            },
            "downloads": -1,
            "filename": "nleval-0.1.0.dev8.tar.gz",
            "has_sig": false,
            "md5_digest": "6e221bb4bd9fbdf1c5460150e6ecfc2c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 115172,
            "upload_time": "2023-06-08T15:54:00",
            "upload_time_iso_8601": "2023-06-08T15:54:00.738726Z",
            "url": "https://files.pythonhosted.org/packages/f1/c1/d00dbdf46120b4670d27028ff96a140f6001e83b3f00b8b5d28fd640a0d9/nleval-0.1.0.dev8.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-06-08 15:54:00",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "krishnanlab",
    "github_project": "NetworkLearningEval",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "tox": true,
    "lcname": "nleval"
}
        
Elapsed time: 0.08116s