py-graph-match


Namepy-graph-match JSON
Version 0.0.8 PyPI version JSON
download
home_pagehttps://github.com/nmdp-bioinformatics/
SummaryGraph Match
upload_time2024-01-29 17:00:42
maintainer
docs_urlNone
authorPradeep Bashyal
requires_python>=3.8
licenseLGPL 3.0
keywords graph hla
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            py-graph-match
===================

Matching with Graph

`grma`` is a package for finding HLA matches using graphs approach.
The matching is based on [grim's](https://github.com/nmdp-bioinformatics/py-graph-imputation) imputation.


## Pre-requisites

### Data Directory Structure

```
data
├── donors_dir
│   └── donors.txt
├── hpf.csv
└── patients.txt
```

### conf Directory Structure

```
conf
└── minimal-configuration.json
```

Follow these steps for finding matches:

Setup a virtual environment (venv) and run:
```
make install
```

## Quick Getting Started

Get Started with a built-in example.

### Build 'Donors Graph'

```
python test_build_donors_graph.py
```

### Find Matches

Use grma algorthm for finding matches efficiently. You can run the file `test_matching.py`
```
python test_matching.py
```

Find the match results in `results` directory.

# Full Walk through
### Building The Donors' Graph:

The donors' graph is a graph which contains all the donors (the search space). It implemented using a LOL (List of Lists) representation written in cython for better time and memory efficiency.
The building might take a lot of memory and time, so it's recommended to save the graph in a pickle file.

Before building the donors' graph, all the donors' HLAs must be imputed using `grim`.
Then all the imputation files must be saved under the same directory.

```python
import os
from grma.donorsgraph.build_donors_graph import BuildMatchingGraph

PATH_TO_DONORS_DIR = "data/donors_dir"
PATH_TO_DONORS_GRAPH = "output/donors_graph.pkl"

os.makedirs(f"output", exist_ok=True)

build_matching = BuildMatchingGraph(PATH_TO_DONORS_DIR)
graph = build_matching.graph  # access the donors' graph

build_matching.to_pickle(PATH_TO_DONORS_GRAPH)  # save the donors' graph to pickle
```

### Search & Match before imputation to patients
The function `matching` finds matches up to 3 mismatches and return a `pandas.DataFrame` object of the matches sorted by number of mismatches and their score.

The function get these parameters:
* match_graph: a grma donors' graph object - `grma.match.Graph`
* grim_config_file: a path to `grim` configuration file


```python
from grma.match import Graph, matching

PATH_TO_DONORS_GRAPH = "output/donors_graph.pkl"
PATH_CONGIF_FILE = "conf/minimal-configuration.json"


# The donors' graph we built earlier
donors_graph = Graph.from_pickle(PATH_TO_DONORS_GRAPH)


# matching_results is a dict - {patient_id: the patient's result dataframe}
matching_results = matching(donors_graph,PATH_CONGIF_FILE, search_id=1, donors_info=[],
                                    threshold=0.1, cutoff=100, save_to_csv=True, output_dir="results")

```

`matching` takes some optional parameters, which you might want to change:

* search_id: An integer identification of the search. default is 0.
* donors_info: An iterable of fields from the database to include in the results. default is None.
* threshold: Minimal score value for a valid match. default is 0.1.
* cutof: Maximum number of matches to return. default is 50.
* verbose: A boolean flag for whether to print the documentation. default is False
* save_to_csv: A boolean flag for whether to save the matching results into a csv file. default is False.  If the field is set to True, upon completion of the function, it will generate a directory named `search_1`
* `output_dir`: output directory to write match results file to

### Search & Match after imputation to patients

The function `find_mathces` find matches up to 3 mismatches and return a `pandas.DataFrame` object of the matches
 sorted by number of mismatches and their score.

They get these parameters:
* imputation_filename: a path to the file of the patients' typing.
* match_graph: a grma donors' graph object - `grma.match.Graph`

```python
from grma.match import Graph, find_matches

PATH_TO_PATIENTS_FILE = "data/patients_file.txt"
PATH_TO_DONORS_GRAPH = "output/donors_graph.pkl"

# The donors' graph we built earlier
donors_graph = Graph.from_pickle(PATH_TO_DONORS_GRAPH)
matching_results = find_matches(PATH_TO_PATIENTS_FILE, donors_graph)

# matching_results is a dict - {patient_id: the patient's result dataframe}

for patient, df in matching_results.items():
    # Use here the dataframe 'df' with the results for 'patient'
    print(patient, df)
```

`find_matches` takes some optional parameters, which you might want to change:
* search_id: An integer identification of the search. default is 0.
* donors_info: An iterable of fields from the database to include in the results. default is None.
* threshold: Minimal score value for a valid match. default is 0.1.
* cutof: Maximum number of matches to return. default is 50.
* verbose: A boolean flag for whether to print the documentation. default is False
* save_to_csv: A boolean flag for whether to save the matching results into a csv file. default is False.
If the field is set to True, upon completion of the function, it will generate a directory named `Matching_Result
s_1`.
* calculate_time: A boolean flag for whether to return the matching time for patient. default is False.
  In case `calculate_time=True` the output will be dict like this: `{patient_id: (results_dataframe, time)}`
* `output_dir`: output directory to write match results file to

### Set Database
In order to get in the matching results more information about the donors than the matching information,
one can set a database that has all the donors' information in it.
The database must be a `pandas.DataFrame` that its indexes are the donors' IDs.

After setting the database, when calling one of the matching functions,
you may set in the `donor_info` variable a `list` with the names of the columns you want to join to the result dataframe from the database.

Example of setting the database:

```python
import pandas as pd
from grma.match import set_database

donors = [0, 1, 2]
database = pd.DataFrame([[30], [32], [25]], columns=["Age"], index=donors)

set_database(database)
```


# How to contribute:

1. Fork the repository: https://github.com/nmdp-bioinformatics/py-graph-match.git
2. Clone the repository locally
    ```shell
    git clone  https://github.com/<Your-Github-ID>/py-graph-match.git
    cd py-graph-match
    ```
3. Make a virtual environment and activate it, run `make venv`
   ```shell
    > make venv
      python3 -m venv venv --prompt urban-potato-venv
      =====================================================================
    To activate the new virtual environment, execute the following from your shell
    source venv/bin/activate
   ```
4. Source the virtual environment
   ```shell
   source venv/bin/activate
   ```
5. Development workflow is driven through `Makefile`. Use `make` to list show all targets.
   ```
    > make
    clean                remove all build, test, coverage and Python artifacts
    clean-build          remove build artifacts
    clean-pyc            remove Python file artifacts
    clean-test           remove test and coverage artifacts
    lint                 check style with flake8
    behave               run the behave tests, generate and serve report
    pytest               run tests quickly with the default Python
    test                 run all(BDD and unit) tests
    coverage             check code coverage quickly with the default Python
    dist                 builds source and wheel package
    docker-build         build a docker image for the service
    docker               build a docker image for the service
    install              install the package to the active Python's site-packages
    venv                 creates a Python3 virtualenv environment in venv
    activate             activate a virtual environment. Run `make venv` before activating.
   ```
6. Install all the development dependencies. Will install packages from all `requirements-*.txt` files.
   ```shell
    make install
   ```
7. The Gherkin Feature files, step files and pytest files go in `tests` directory:
    ```
    tests
    |-- features
    |   |-- algorithm
    |   |   `-- SLUG\ Match.feature
    |   `-- definition
    |       `-- Class\ I\ HLA\ Alleles.feature
    |-- steps
    |   |-- HLA_alleles.py
    |   `-- SLUG_match.py
    `-- unit
        `-- test_py-graph-match.py
    ```
8. Package Module files go in the `py-graph-match` directory.
    ```
    py-graph-match
    |-- __init__.py
    |-- algorithm
    |   `-- match.py
    |-- model
    |   |-- allele.py
    |   `-- slug.py
    `-- py-graph-match.py
    ```
9. Run all tests with `make test` or different tests with `make behave` or `make pytest`. `make behave` will generate report files and open the browser to the report.
10. Use `python app.py` to run the Flask service app in debug mode. Service will be available at http://localhost:8080/
11. Use `make docker-build` to build a docker image using the current `Dockerfile`.
12. `make docker` will build and run the docker image with the service.  Service will be available at http://localhost:8080/


=======
History
=======

0.0.1 (2021-08-25)
------------------

* First release on PyPI.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/nmdp-bioinformatics/",
    "name": "py-graph-match",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "Graph,HLA",
    "author": "Pradeep Bashyal",
    "author_email": "pbashyal@nmdp.org",
    "download_url": "https://files.pythonhosted.org/packages/5c/66/578e542cb69c72b33c5e420b7d7ce7ca205214031a7ee51bb8830faa59b1/py-graph-match-0.0.8.tar.gz",
    "platform": null,
    "description": "py-graph-match\n===================\n\nMatching with Graph\n\n`grma`` is a package for finding HLA matches using graphs approach.\nThe matching is based on [grim's](https://github.com/nmdp-bioinformatics/py-graph-imputation) imputation.\n\n\n## Pre-requisites\n\n### Data Directory Structure\n\n```\ndata\n\u251c\u2500\u2500 donors_dir\n\u2502   \u2514\u2500\u2500 donors.txt\n\u251c\u2500\u2500 hpf.csv\n\u2514\u2500\u2500 patients.txt\n```\n\n### conf Directory Structure\n\n```\nconf\n\u2514\u2500\u2500 minimal-configuration.json\n```\n\nFollow these steps for finding matches:\n\nSetup a virtual environment (venv) and run:\n```\nmake install\n```\n\n## Quick Getting Started\n\nGet Started with a built-in example.\n\n### Build 'Donors Graph'\n\n```\npython test_build_donors_graph.py\n```\n\n### Find Matches\n\nUse grma algorthm for finding matches efficiently. You can run the file `test_matching.py`\n```\npython test_matching.py\n```\n\nFind the match results in `results` directory.\n\n# Full Walk through\n### Building The Donors' Graph:\n\nThe donors' graph is a graph which contains all the donors (the search space). It implemented using a LOL (List of Lists) representation written in cython for better time and memory efficiency.\nThe building might take a lot of memory and time, so it's recommended to save the graph in a pickle file.\n\nBefore building the donors' graph, all the donors' HLAs must be imputed using `grim`.\nThen all the imputation files must be saved under the same directory.\n\n```python\nimport os\nfrom grma.donorsgraph.build_donors_graph import BuildMatchingGraph\n\nPATH_TO_DONORS_DIR = \"data/donors_dir\"\nPATH_TO_DONORS_GRAPH = \"output/donors_graph.pkl\"\n\nos.makedirs(f\"output\", exist_ok=True)\n\nbuild_matching = BuildMatchingGraph(PATH_TO_DONORS_DIR)\ngraph = build_matching.graph  # access the donors' graph\n\nbuild_matching.to_pickle(PATH_TO_DONORS_GRAPH)  # save the donors' graph to pickle\n```\n\n### Search & Match before imputation to patients\nThe function `matching` finds matches up to 3 mismatches and return a `pandas.DataFrame` object of the matches sorted by number of mismatches and their score.\n\nThe function get these parameters:\n* match_graph: a grma donors' graph object - `grma.match.Graph`\n* grim_config_file: a path to `grim` configuration file\n\n\n```python\nfrom grma.match import Graph, matching\n\nPATH_TO_DONORS_GRAPH = \"output/donors_graph.pkl\"\nPATH_CONGIF_FILE = \"conf/minimal-configuration.json\"\n\n\n# The donors' graph we built earlier\ndonors_graph = Graph.from_pickle(PATH_TO_DONORS_GRAPH)\n\n\n# matching_results is a dict - {patient_id: the patient's result dataframe}\nmatching_results = matching(donors_graph,PATH_CONGIF_FILE, search_id=1, donors_info=[],\n                                    threshold=0.1, cutoff=100, save_to_csv=True, output_dir=\"results\")\n\n```\n\n`matching` takes some optional parameters, which you might want to change:\n\n* search_id: An integer identification of the search. default is 0.\n* donors_info: An iterable of fields from the database to include in the results. default is None.\n* threshold: Minimal score value for a valid match. default is 0.1.\n* cutof: Maximum number of matches to return. default is 50.\n* verbose: A boolean flag for whether to print the documentation. default is False\n* save_to_csv: A boolean flag for whether to save the matching results into a csv file. default is False.  If the field is set to True, upon completion of the function, it will generate a directory named `search_1`\n* `output_dir`: output directory to write match results file to\n\n### Search & Match after imputation to patients\n\nThe function `find_mathces` find matches up to 3 mismatches and return a `pandas.DataFrame` object of the matches\n sorted by number of mismatches and their score.\n\nThey get these parameters:\n* imputation_filename: a path to the file of the patients' typing.\n* match_graph: a grma donors' graph object - `grma.match.Graph`\n\n```python\nfrom grma.match import Graph, find_matches\n\nPATH_TO_PATIENTS_FILE = \"data/patients_file.txt\"\nPATH_TO_DONORS_GRAPH = \"output/donors_graph.pkl\"\n\n# The donors' graph we built earlier\ndonors_graph = Graph.from_pickle(PATH_TO_DONORS_GRAPH)\nmatching_results = find_matches(PATH_TO_PATIENTS_FILE, donors_graph)\n\n# matching_results is a dict - {patient_id: the patient's result dataframe}\n\nfor patient, df in matching_results.items():\n    # Use here the dataframe 'df' with the results for 'patient'\n    print(patient, df)\n```\n\n`find_matches` takes some optional parameters, which you might want to change:\n* search_id: An integer identification of the search. default is 0.\n* donors_info: An iterable of fields from the database to include in the results. default is None.\n* threshold: Minimal score value for a valid match. default is 0.1.\n* cutof: Maximum number of matches to return. default is 50.\n* verbose: A boolean flag for whether to print the documentation. default is False\n* save_to_csv: A boolean flag for whether to save the matching results into a csv file. default is False.\nIf the field is set to True, upon completion of the function, it will generate a directory named `Matching_Result\ns_1`.\n* calculate_time: A boolean flag for whether to return the matching time for patient. default is False.\n  In case `calculate_time=True` the output will be dict like this: `{patient_id: (results_dataframe, time)}`\n* `output_dir`: output directory to write match results file to\n\n### Set Database\nIn order to get in the matching results more information about the donors than the matching information,\none can set a database that has all the donors' information in it.\nThe database must be a `pandas.DataFrame` that its indexes are the donors' IDs.\n\nAfter setting the database, when calling one of the matching functions,\nyou may set in the `donor_info` variable a `list` with the names of the columns you want to join to the result dataframe from the database.\n\nExample of setting the database:\n\n```python\nimport pandas as pd\nfrom grma.match import set_database\n\ndonors = [0, 1, 2]\ndatabase = pd.DataFrame([[30], [32], [25]], columns=[\"Age\"], index=donors)\n\nset_database(database)\n```\n\n\n# How to contribute:\n\n1. Fork the repository: https://github.com/nmdp-bioinformatics/py-graph-match.git\n2. Clone the repository locally\n    ```shell\n    git clone  https://github.com/<Your-Github-ID>/py-graph-match.git\n    cd py-graph-match\n    ```\n3. Make a virtual environment and activate it, run `make venv`\n   ```shell\n    > make venv\n      python3 -m venv venv --prompt urban-potato-venv\n      =====================================================================\n    To activate the new virtual environment, execute the following from your shell\n    source venv/bin/activate\n   ```\n4. Source the virtual environment\n   ```shell\n   source venv/bin/activate\n   ```\n5. Development workflow is driven through `Makefile`. Use `make` to list show all targets.\n   ```\n    > make\n    clean                remove all build, test, coverage and Python artifacts\n    clean-build          remove build artifacts\n    clean-pyc            remove Python file artifacts\n    clean-test           remove test and coverage artifacts\n    lint                 check style with flake8\n    behave               run the behave tests, generate and serve report\n    pytest               run tests quickly with the default Python\n    test                 run all(BDD and unit) tests\n    coverage             check code coverage quickly with the default Python\n    dist                 builds source and wheel package\n    docker-build         build a docker image for the service\n    docker               build a docker image for the service\n    install              install the package to the active Python's site-packages\n    venv                 creates a Python3 virtualenv environment in venv\n    activate             activate a virtual environment. Run `make venv` before activating.\n   ```\n6. Install all the development dependencies. Will install packages from all `requirements-*.txt` files.\n   ```shell\n    make install\n   ```\n7. The Gherkin Feature files, step files and pytest files go in `tests` directory:\n    ```\n    tests\n    |-- features\n    |   |-- algorithm\n    |   |   `-- SLUG\\ Match.feature\n    |   `-- definition\n    |       `-- Class\\ I\\ HLA\\ Alleles.feature\n    |-- steps\n    |   |-- HLA_alleles.py\n    |   `-- SLUG_match.py\n    `-- unit\n        `-- test_py-graph-match.py\n    ```\n8. Package Module files go in the `py-graph-match` directory.\n    ```\n    py-graph-match\n    |-- __init__.py\n    |-- algorithm\n    |   `-- match.py\n    |-- model\n    |   |-- allele.py\n    |   `-- slug.py\n    `-- py-graph-match.py\n    ```\n9. Run all tests with `make test` or different tests with `make behave` or `make pytest`. `make behave` will generate report files and open the browser to the report.\n10. Use `python app.py` to run the Flask service app in debug mode. Service will be available at http://localhost:8080/\n11. Use `make docker-build` to build a docker image using the current `Dockerfile`.\n12. `make docker` will build and run the docker image with the service.  Service will be available at http://localhost:8080/\n\n\n=======\nHistory\n=======\n\n0.0.1 (2021-08-25)\n------------------\n\n* First release on PyPI.\n",
    "bugtrack_url": null,
    "license": "LGPL 3.0",
    "summary": "Graph Match",
    "version": "0.0.8",
    "project_urls": {
        "Homepage": "https://github.com/nmdp-bioinformatics/"
    },
    "split_keywords": [
        "graph",
        "hla"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5c66578e542cb69c72b33c5e420b7d7ce7ca205214031a7ee51bb8830faa59b1",
                "md5": "cf5def11c56206556651c7eac45698d7",
                "sha256": "1a756707d4405e4cae9a7210ddc467d1998a6c57bb306e2fce1d13f09eec2b0b"
            },
            "downloads": -1,
            "filename": "py-graph-match-0.0.8.tar.gz",
            "has_sig": false,
            "md5_digest": "cf5def11c56206556651c7eac45698d7",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 32526,
            "upload_time": "2024-01-29T17:00:42",
            "upload_time_iso_8601": "2024-01-29T17:00:42.142124Z",
            "url": "https://files.pythonhosted.org/packages/5c/66/578e542cb69c72b33c5e420b7d7ce7ca205214031a7ee51bb8830faa59b1/py-graph-match-0.0.8.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-01-29 17:00:42",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "py-graph-match"
}
        
Elapsed time: 2.77290s