graphguest


Namegraphguest JSON
Version 1.1.0 PyPI version JSON
download
home_page
SummaryGraph Universal Embedding Splitting Tool
upload_time2023-09-08 09:46:46
maintainer
docs_urlNone
author
requires_python>=3
licenseMIT License Copyright (c) [2022] [ubioinformat] Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords graph embedding pypi drug repurposing splitting graph embedding splitting
VCS
bugtrack_url
requirements colorama joblib numpy pandas python-dateutil pytz scikit-learn scipy six threadpoolctl tqdm tzdata
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Graph Universal Embedding Splitting Tool (GUEST)

## Description

This is a package for evaluating Graph Embedding prediction methodologies. GraphGuest works on any kind of heterogeneous undirected graphs with exact 2 types of nodes and 1 type of edge. It was developed in the context of drug repurposing, as a part of the paper "Towards a more inductive world for drug repurposing approaches". From now on, we will refer to the nodes type 1 and 2 as Drugs and Proteins, respectively, hence the evaluated graph would be a Drug Target Interaction (DTI) network.

### GraphGuest splitting functionality

GraphGuest allows to split any chosen network into train/test following several criteria: 
- **Random**: There are no constraints imposed, DTIs are distributed across train and test randomly.
- **Sp**: Related to pairs. Any drug or protein may appear both in the train and test set, but interactions cannot be duplicated in the two sets.
- **Sd**: Related to drug nodes. Drug nodes are not duplicated in the train and test set, i.e., a node evaluated during training does not appear in the test set. 
- **St**: Related to targets. Protein nodes are not duplicated in the train and test set, each protein seen during training does not appear in the test set. 


<p align="center" width="100%">
    <img width="50%" src="https://raw.githubusercontent.com/ubioinformat/GraphGuest/aa9624ef53498a1e239d67f3a2952411187fee2e/imgs/Splitting.PNG">
</p>

### GraphGuest subsampling functionality

Generally DTI networks are highly sparse, i.e., there is a high number of negative interactions compared to the positive ones. Hence, including all negative edges is not feasible, 
and would bias the model towards negative predictions. Accordingly, usually a balanced dataset is built by selecting all the positive interactions 
and subsampling the same number (negative to positive ratio of 1) of negatives randomly. In the presented work, we showed that random subsampling can oversimplify the 
prediction task, as it is likely that the model is not evaluated on hard-to-classify negative samples. Also, this subsampling methodology lacks of biological meaning.
Hence, we proposed to weight negative interactions based on a structural-based metric (RMSD of the distance between atoms of two protein structures) to find hard-to-classify
samples and increase accuracy and robustness of the drug repurposing model.

In this line, GraphGuest allows to use a matrix of distances/scores between every Protein as an alternative to random subsampling. If this matrix is provided, for each positive DTI,
the negative DTI will be formed by the same drug and the protein that better maximizes (or minimizes) the distance/score with respect to the original protein from the positive DTI.

<p align="center" width="100%">
    <img width="50%" src="https://raw.githubusercontent.com/ubioinformat/GraphGuest/aa9624ef53498a1e239d67f3a2952411187fee2e/imgs/RMSD.PNG">
</p>

## How to use it

Here now we describe the functionalities and parameters of the GraphGuest GUEST class:
- **DTIs**: Interaction list in the form of a pandas matrix with the columns "Drug" and "Protein" as the type 1 and 2 nodes.
- **mode**: The already introduced split criteria: random, Sp, Sd or St (default: Sp).
- **subsampling**: Whether all interactions are chosen to build the dataset or subsampling is preferred instead (default: True).
- **n_seeds**: Number of times the dataset will be built, varying the seed, hence yielding different splits (default: 5).
- **negative_to_positive_ratio**: How many negatives DTI will be subsampled respect to the positives ones  (default: 1).

First, load the required libraries:

    from graphguest import GUEST
    import pandas as pd
    import pickle

Then, load the DTI dataset. It must be a pandas matrix containing the columns "Drug" and "Protein". An example of the Yamanishi's NR network is located in the test folder (nr_dti.txt).

    DTIs = pd.read_csv("tests/nr_dti.txt", sep='\t', header=None) 
    DTIs.columns = ['Drug', 'Protein']

Load the GUEST object, specifying the DTI dataset, the mode you want the dataset to fulfill, as well as subsampling options and number of seeds.

    ggnr = GUEST(DTIs, mode = "Sp", subsampling = True, n_seeds = 5)

You can optionally pass a Protein column's score matrix as an argument. This matrix is computed by compute RMSD(atomic distances) between every pair of proteins in the NR dataset we're using. As a result, negative subsampling will shift from random selection to a rank-based approach. For each Drug-Target Interaction (DTI), we'll select negative DTIs using their rank and a predefined threshold. Here the discarded RMSD values are <=2.5, the held out are >2.5 & <=5 and >5 & <=6 to subsample negatives (see RMSD Figure for reference).

    ggnr.apply_rank(RMSD_threshold = (2.5, 5, 6), fpath = 'tests/rmsd_nr.pkl') #(example matrix contains random values)

Now, generate the splits according to the specified options. GraphGuest can generate n folds fulfilling 
split criteria in a Cross-Validation fashion, or following a Train/Val/Test configuration.
 
    ggnr.generate_splits_cv(foldnum=10) #(Cross-Validation)
    ggnr.generate_splits_tvt() #(Train-Validation-Test)

Finally, retrieve the results. If RMSD option has been applied, the held-out fold will be returned (See RMSD Figure). Also, a node
embedding dictionary can be passed as an argument to generate the node embedding datasets according to the generated split distribution.

    #load the node embedding dictionary (we randomly generate (2,) and (3,) shaped arrays for drugs and proteins, respectively)
    with open('tests/node_emb_nr.pkl', 'rb') as handle:
        node_emb = pickle.load(handle)
    
    #retrieve the results (if node_emb is not passed, then seed_cv_ne won't be returned)
    seed_cv_list = ggnr.retrieve_results() #(Default)
    seed_cv_list, seed_cv_ne = ggnr.retrieve_results(node_emb) #(Default with node_emb dictionary)

    seed_cv_list, final_fold = ggnr.retrieve_results() #(RMSD applied)
    seed_cv_list, final_fold, seed_cv_ne = ggnr.retrieve_results(node_emb) #(RMSD applied with node_emb dictionary)

You can verify that your splits fulfill the mode requirements after they have been generated. Note that
if any rank matrix is applied instead of random subsampling, split mode will be ignored due to 
inconsistencies between the rank and the split constraints.

    ggnr.test_splits() 
    
    #verbose can be set to True if more information is desired. (verbose's default: False)
    ggnr.test_splits(verbose=True) 

    #You can also visualize the final distribution of DTIs by setting distr to True (distr's default: False)
    ggnr.test_splits(distr=True)
    

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "graphguest",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3",
    "maintainer_email": "",
    "keywords": "graph embedding,pypi,drug repurposing,splitting,graph embedding splitting",
    "author": "",
    "author_email": "Jes\u00fas de la Fuente Cede\u00f1o <jdlfuentec@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/e8/b0/625477b5b7d30c78fff0ff52fa8277d4ed3d9af28a606799a7dca437c4e6/graphguest-1.1.0.tar.gz",
    "platform": null,
    "description": "# Graph Universal Embedding Splitting Tool (GUEST)\r\n\r\n## Description\r\n\r\nThis is a package for evaluating Graph Embedding prediction methodologies. GraphGuest works on any kind of heterogeneous undirected graphs with exact 2 types of nodes and 1 type of edge. It was developed in the context of drug repurposing, as a part of the paper \"Towards a more inductive world for drug repurposing approaches\". From now on, we will refer to the nodes type 1 and 2 as Drugs and Proteins, respectively, hence the evaluated graph would be a Drug Target Interaction (DTI) network.\r\n\r\n### GraphGuest splitting functionality\r\n\r\nGraphGuest allows to split any chosen network into train/test following several criteria: \r\n- **Random**: There are no constraints imposed, DTIs are distributed across train and test randomly.\r\n- **Sp**: Related to pairs. Any drug or protein may appear both in the train and test set, but interactions cannot be duplicated in the two sets.\r\n- **Sd**: Related to drug nodes. Drug nodes are not duplicated in the train and test set, i.e., a node evaluated during training does not appear in the test set. \r\n- **St**: Related to targets. Protein nodes are not duplicated in the train and test set, each protein seen during training does not appear in the test set. \r\n\r\n\r\n<p align=\"center\" width=\"100%\">\r\n    <img width=\"50%\" src=\"https://raw.githubusercontent.com/ubioinformat/GraphGuest/aa9624ef53498a1e239d67f3a2952411187fee2e/imgs/Splitting.PNG\">\r\n</p>\r\n\r\n### GraphGuest subsampling functionality\r\n\r\nGenerally DTI networks are highly sparse, i.e., there is a high number of negative interactions compared to the positive ones. Hence, including all negative edges is not feasible, \r\nand would bias the model towards negative predictions. Accordingly, usually a balanced dataset is built by selecting all the positive interactions \r\nand subsampling the same number (negative to positive ratio of 1) of negatives randomly. In the presented work, we showed that random subsampling can oversimplify the \r\nprediction task, as it is likely that the model is not evaluated on hard-to-classify negative samples. Also, this subsampling methodology lacks of biological meaning.\r\nHence, we proposed to weight negative interactions based on a structural-based metric (RMSD of the distance between atoms of two protein structures) to find hard-to-classify\r\nsamples and increase accuracy and robustness of the drug repurposing model.\r\n\r\nIn this line, GraphGuest allows to use a matrix of distances/scores between every Protein as an alternative to random subsampling. If this matrix is provided, for each positive DTI,\r\nthe negative DTI will be formed by the same drug and the protein that better maximizes (or minimizes) the distance/score with respect to the original protein from the positive DTI.\r\n\r\n<p align=\"center\" width=\"100%\">\r\n    <img width=\"50%\" src=\"https://raw.githubusercontent.com/ubioinformat/GraphGuest/aa9624ef53498a1e239d67f3a2952411187fee2e/imgs/RMSD.PNG\">\r\n</p>\r\n\r\n## How to use it\r\n\r\nHere now we describe the functionalities and parameters of the GraphGuest GUEST class:\r\n- **DTIs**: Interaction list in the form of a pandas matrix with the columns \"Drug\" and \"Protein\" as the type 1 and 2 nodes.\r\n- **mode**: The already introduced split criteria: random, Sp, Sd or St (default: Sp).\r\n- **subsampling**: Whether all interactions are chosen to build the dataset or subsampling is preferred instead (default: True).\r\n- **n_seeds**: Number of times the dataset will be built, varying the seed, hence yielding different splits (default: 5).\r\n- **negative_to_positive_ratio**: How many negatives DTI will be subsampled respect to the positives ones  (default: 1).\r\n\r\nFirst, load the required libraries:\r\n\r\n    from graphguest import GUEST\r\n    import pandas as pd\r\n    import pickle\r\n\r\nThen, load the DTI dataset. It must be a pandas matrix containing the columns \"Drug\" and \"Protein\". An example of the Yamanishi's NR network is located in the test folder (nr_dti.txt).\r\n\r\n    DTIs = pd.read_csv(\"tests/nr_dti.txt\", sep='\\t', header=None) \r\n    DTIs.columns = ['Drug', 'Protein']\r\n\r\nLoad the GUEST object, specifying the DTI dataset, the mode you want the dataset to fulfill, as well as subsampling options and number of seeds.\r\n\r\n    ggnr = GUEST(DTIs, mode = \"Sp\", subsampling = True, n_seeds = 5)\r\n\r\nYou can optionally pass a Protein column's score matrix as an argument. This matrix is computed by compute RMSD(atomic distances) between every pair of proteins in the NR dataset we're using. As a result, negative subsampling will shift from random selection to a rank-based approach. For each Drug-Target Interaction (DTI), we'll select negative DTIs using their rank and a predefined threshold. Here the discarded RMSD values are <=2.5, the held out are >2.5 & <=5 and >5 & <=6 to subsample negatives (see RMSD Figure for reference).\r\n\r\n    ggnr.apply_rank(RMSD_threshold = (2.5, 5, 6), fpath = 'tests/rmsd_nr.pkl') #(example matrix contains random values)\r\n\r\nNow, generate the splits according to the specified options. GraphGuest can generate n folds fulfilling \r\nsplit criteria in a Cross-Validation fashion, or following a Train/Val/Test configuration.\r\n \r\n    ggnr.generate_splits_cv(foldnum=10) #(Cross-Validation)\r\n    ggnr.generate_splits_tvt() #(Train-Validation-Test)\r\n\r\nFinally, retrieve the results. If RMSD option has been applied, the held-out fold will be returned (See RMSD Figure). Also, a node\r\nembedding dictionary can be passed as an argument to generate the node embedding datasets according to the generated split distribution.\r\n\r\n    #load the node embedding dictionary (we randomly generate (2,) and (3,) shaped arrays for drugs and proteins, respectively)\r\n    with open('tests/node_emb_nr.pkl', 'rb') as handle:\r\n        node_emb = pickle.load(handle)\r\n    \r\n    #retrieve the results (if node_emb is not passed, then seed_cv_ne won't be returned)\r\n    seed_cv_list = ggnr.retrieve_results() #(Default)\r\n    seed_cv_list, seed_cv_ne = ggnr.retrieve_results(node_emb) #(Default with node_emb dictionary)\r\n\r\n    seed_cv_list, final_fold = ggnr.retrieve_results() #(RMSD applied)\r\n    seed_cv_list, final_fold, seed_cv_ne = ggnr.retrieve_results(node_emb) #(RMSD applied with node_emb dictionary)\r\n\r\nYou can verify that your splits fulfill the mode requirements after they have been generated. Note that\r\nif any rank matrix is applied instead of random subsampling, split mode will be ignored due to \r\ninconsistencies between the rank and the split constraints.\r\n\r\n    ggnr.test_splits() \r\n    \r\n    #verbose can be set to True if more information is desired. (verbose's default: False)\r\n    ggnr.test_splits(verbose=True) \r\n\r\n    #You can also visualize the final distribution of DTIs by setting distr to True (distr's default: False)\r\n    ggnr.test_splits(distr=True)\r\n    \r\n",
    "bugtrack_url": null,
    "license": "MIT License  Copyright (c) [2022] [ubioinformat]  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ",
    "summary": "Graph Universal Embedding Splitting Tool",
    "version": "1.1.0",
    "project_urls": {
        "Homepage": "https://github.com/ubioinformat/guest"
    },
    "split_keywords": [
        "graph embedding",
        "pypi",
        "drug repurposing",
        "splitting",
        "graph embedding splitting"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "21579a7ae4cb9da994b3c7aa2d8c51968223b66d094b18de78c4e1921cc704f4",
                "md5": "3d35e7fa8f3c46e4d2a9b98eb8989773",
                "sha256": "c7d131b4117f406e65282a97967fd7456fc3f29ee17f0a6eb39a18b21320ed6b"
            },
            "downloads": -1,
            "filename": "graphguest-1.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "3d35e7fa8f3c46e4d2a9b98eb8989773",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3",
            "size": 17465,
            "upload_time": "2023-09-08T09:46:44",
            "upload_time_iso_8601": "2023-09-08T09:46:44.994659Z",
            "url": "https://files.pythonhosted.org/packages/21/57/9a7ae4cb9da994b3c7aa2d8c51968223b66d094b18de78c4e1921cc704f4/graphguest-1.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e8b0625477b5b7d30c78fff0ff52fa8277d4ed3d9af28a606799a7dca437c4e6",
                "md5": "915942c3589a9221a3549ef8ed2ec857",
                "sha256": "02a7c050caf8b08a88352ed0e1b0f302eee70e3123938918c6581bbd38fbc894"
            },
            "downloads": -1,
            "filename": "graphguest-1.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "915942c3589a9221a3549ef8ed2ec857",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3",
            "size": 18278,
            "upload_time": "2023-09-08T09:46:46",
            "upload_time_iso_8601": "2023-09-08T09:46:46.403537Z",
            "url": "https://files.pythonhosted.org/packages/e8/b0/625477b5b7d30c78fff0ff52fa8277d4ed3d9af28a606799a7dca437c4e6/graphguest-1.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-09-08 09:46:46",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ubioinformat",
    "github_project": "guest",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "colorama",
            "specs": [
                [
                    "==",
                    "0.4.6"
                ]
            ]
        },
        {
            "name": "joblib",
            "specs": [
                [
                    "==",
                    "1.3.2"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    "==",
                    "1.25.2"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    "==",
                    "2.0.3"
                ]
            ]
        },
        {
            "name": "python-dateutil",
            "specs": [
                [
                    "==",
                    "2.8.2"
                ]
            ]
        },
        {
            "name": "pytz",
            "specs": [
                [
                    "==",
                    "2023.3"
                ]
            ]
        },
        {
            "name": "scikit-learn",
            "specs": [
                [
                    "==",
                    "1.3.0"
                ]
            ]
        },
        {
            "name": "scipy",
            "specs": [
                [
                    "==",
                    "1.11.2"
                ]
            ]
        },
        {
            "name": "six",
            "specs": [
                [
                    "==",
                    "1.16.0"
                ]
            ]
        },
        {
            "name": "threadpoolctl",
            "specs": [
                [
                    "==",
                    "3.2.0"
                ]
            ]
        },
        {
            "name": "tqdm",
            "specs": [
                [
                    "==",
                    "4.66.1"
                ]
            ]
        },
        {
            "name": "tzdata",
            "specs": [
                [
                    "==",
                    "2023.3"
                ]
            ]
        }
    ],
    "lcname": "graphguest"
}
        
Elapsed time: 0.11610s