py-graph-match
===================
Matching with Graph
`grma`` is a package for finding HLA matches using graphs approach.
The matching is based on [grim's](https://github.com/nmdp-bioinformatics/py-graph-imputation) imputation.
## Pre-requisites
### Data Directory Structure
```
data
├── donors_dir
│ └── donors.txt
├── hpf.csv
└── patients.txt
```
### conf Directory Structure
```
conf
└── minimal-configuration.json
```
Follow these steps for finding matches:
Setup a virtual environment (venv) and run:
```
make install
```
## Quick Getting Started
Get Started with a built-in example.
### Build 'Donors Graph'
```
python test_build_donors_graph.py
```
### Find Matches
Use grma algorthm for finding matches efficiently. You can run the file `test_matching.py`
```
python test_matching.py
```
Find the match results in `results` directory.
# Full Walk through
### Building The Donors' Graph:
The donors' graph is a graph which contains all the donors (the search space). It implemented using a LOL (List of Lists) representation written in cython for better time and memory efficiency.
The building might take a lot of memory and time, so it's recommended to save the graph in a pickle file.
Before building the donors' graph, all the donors' HLAs must be imputed using `grim`.
Then all the imputation files must be saved under the same directory.
```python
import os
from grma.donorsgraph.build_donors_graph import BuildMatchingGraph
PATH_TO_DONORS_DIR = "data/donors_dir"
PATH_TO_DONORS_GRAPH = "output/donors_graph.pkl"
os.makedirs(f"output", exist_ok=True)
build_matching = BuildMatchingGraph(PATH_TO_DONORS_DIR)
graph = build_matching.graph # access the donors' graph
build_matching.to_pickle(PATH_TO_DONORS_GRAPH) # save the donors' graph to pickle
```
### Search & Match before imputation to patients
The function `matching` finds matches up to 3 mismatches and return a `pandas.DataFrame` object of the matches sorted by number of mismatches and their score.
The function get these parameters:
* match_graph: a grma donors' graph object - `grma.match.Graph`
* grim_config_file: a path to `grim` configuration file
```python
from grma.match import Graph, matching
PATH_TO_DONORS_GRAPH = "output/donors_graph.pkl"
PATH_CONGIF_FILE = "conf/minimal-configuration.json"
# The donors' graph we built earlier
donors_graph = Graph.from_pickle(PATH_TO_DONORS_GRAPH)
# matching_results is a dict - {patient_id: the patient's result dataframe}
matching_results = matching(donors_graph,PATH_CONGIF_FILE, search_id=1, donors_info=[],
threshold=0.1, cutoff=100, save_to_csv=True, output_dir="results")
```
`matching` takes some optional parameters, which you might want to change:
* search_id: An integer identification of the search. default is 0.
* donors_info: An iterable of fields from the database to include in the results. default is None.
* threshold: Minimal score value for a valid match. default is 0.1.
* cutof: Maximum number of matches to return. default is 50.
* verbose: A boolean flag for whether to print the documentation. default is False
* save_to_csv: A boolean flag for whether to save the matching results into a csv file. default is False. If the field is set to True, upon completion of the function, it will generate a directory named `search_1`
* `output_dir`: output directory to write match results file to
### Search & Match after imputation to patients
The function `find_mathces` find matches up to 3 mismatches and return a `pandas.DataFrame` object of the matches
sorted by number of mismatches and their score.
They get these parameters:
* imputation_filename: a path to the file of the patients' typing.
* match_graph: a grma donors' graph object - `grma.match.Graph`
```python
from grma.match import Graph, find_matches
PATH_TO_PATIENTS_FILE = "data/patients_file.txt"
PATH_TO_DONORS_GRAPH = "output/donors_graph.pkl"
# The donors' graph we built earlier
donors_graph = Graph.from_pickle(PATH_TO_DONORS_GRAPH)
matching_results = find_matches(PATH_TO_PATIENTS_FILE, donors_graph)
# matching_results is a dict - {patient_id: the patient's result dataframe}
for patient, df in matching_results.items():
# Use here the dataframe 'df' with the results for 'patient'
print(patient, df)
```
`find_matches` takes some optional parameters, which you might want to change:
* search_id: An integer identification of the search. default is 0.
* donors_info: An iterable of fields from the database to include in the results. default is None.
* threshold: Minimal score value for a valid match. default is 0.1.
* cutof: Maximum number of matches to return. default is 50.
* verbose: A boolean flag for whether to print the documentation. default is False
* save_to_csv: A boolean flag for whether to save the matching results into a csv file. default is False.
If the field is set to True, upon completion of the function, it will generate a directory named `Matching_Result
s_1`.
* calculate_time: A boolean flag for whether to return the matching time for patient. default is False.
In case `calculate_time=True` the output will be dict like this: `{patient_id: (results_dataframe, time)}`
* `output_dir`: output directory to write match results file to
### Set Database
In order to get in the matching results more information about the donors than the matching information,
one can set a database that has all the donors' information in it.
The database must be a `pandas.DataFrame` that its indexes are the donors' IDs.
After setting the database, when calling one of the matching functions,
you may set in the `donor_info` variable a `list` with the names of the columns you want to join to the result dataframe from the database.
Example of setting the database:
```python
import pandas as pd
from grma.match import set_database
donors = [0, 1, 2]
database = pd.DataFrame([[30], [32], [25]], columns=["Age"], index=donors)
set_database(database)
```
# How to contribute:
1. Fork the repository: https://github.com/nmdp-bioinformatics/py-graph-match.git
2. Clone the repository locally
```shell
git clone https://github.com/<Your-Github-ID>/py-graph-match.git
cd py-graph-match
```
3. Make a virtual environment and activate it, run `make venv`
```shell
> make venv
python3 -m venv venv --prompt urban-potato-venv
=====================================================================
To activate the new virtual environment, execute the following from your shell
source venv/bin/activate
```
4. Source the virtual environment
```shell
source venv/bin/activate
```
5. Development workflow is driven through `Makefile`. Use `make` to list show all targets.
```
> make
clean remove all build, test, coverage and Python artifacts
clean-build remove build artifacts
clean-pyc remove Python file artifacts
clean-test remove test and coverage artifacts
lint check style with flake8
behave run the behave tests, generate and serve report
pytest run tests quickly with the default Python
test run all(BDD and unit) tests
coverage check code coverage quickly with the default Python
dist builds source and wheel package
docker-build build a docker image for the service
docker build a docker image for the service
install install the package to the active Python's site-packages
venv creates a Python3 virtualenv environment in venv
activate activate a virtual environment. Run `make venv` before activating.
```
6. Install all the development dependencies. Will install packages from all `requirements-*.txt` files.
```shell
make install
```
7. The Gherkin Feature files, step files and pytest files go in `tests` directory:
```
tests
|-- features
| |-- algorithm
| | `-- SLUG\ Match.feature
| `-- definition
| `-- Class\ I\ HLA\ Alleles.feature
|-- steps
| |-- HLA_alleles.py
| `-- SLUG_match.py
`-- unit
`-- test_py-graph-match.py
```
8. Package Module files go in the `py-graph-match` directory.
```
py-graph-match
|-- __init__.py
|-- algorithm
| `-- match.py
|-- model
| |-- allele.py
| `-- slug.py
`-- py-graph-match.py
```
9. Run all tests with `make test` or different tests with `make behave` or `make pytest`. `make behave` will generate report files and open the browser to the report.
10. Use `python app.py` to run the Flask service app in debug mode. Service will be available at http://localhost:8080/
11. Use `make docker-build` to build a docker image using the current `Dockerfile`.
12. `make docker` will build and run the docker image with the service. Service will be available at http://localhost:8080/
=======
History
=======
0.0.1 (2021-08-25)
------------------
* First release on PyPI.
Raw data
{
"_id": null,
"home_page": "https://github.com/nmdp-bioinformatics/",
"name": "py-graph-match",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "",
"keywords": "Graph,HLA",
"author": "Pradeep Bashyal",
"author_email": "pbashyal@nmdp.org",
"download_url": "https://files.pythonhosted.org/packages/5c/66/578e542cb69c72b33c5e420b7d7ce7ca205214031a7ee51bb8830faa59b1/py-graph-match-0.0.8.tar.gz",
"platform": null,
"description": "py-graph-match\n===================\n\nMatching with Graph\n\n`grma`` is a package for finding HLA matches using graphs approach.\nThe matching is based on [grim's](https://github.com/nmdp-bioinformatics/py-graph-imputation) imputation.\n\n\n## Pre-requisites\n\n### Data Directory Structure\n\n```\ndata\n\u251c\u2500\u2500 donors_dir\n\u2502 \u2514\u2500\u2500 donors.txt\n\u251c\u2500\u2500 hpf.csv\n\u2514\u2500\u2500 patients.txt\n```\n\n### conf Directory Structure\n\n```\nconf\n\u2514\u2500\u2500 minimal-configuration.json\n```\n\nFollow these steps for finding matches:\n\nSetup a virtual environment (venv) and run:\n```\nmake install\n```\n\n## Quick Getting Started\n\nGet Started with a built-in example.\n\n### Build 'Donors Graph'\n\n```\npython test_build_donors_graph.py\n```\n\n### Find Matches\n\nUse grma algorthm for finding matches efficiently. You can run the file `test_matching.py`\n```\npython test_matching.py\n```\n\nFind the match results in `results` directory.\n\n# Full Walk through\n### Building The Donors' Graph:\n\nThe donors' graph is a graph which contains all the donors (the search space). It implemented using a LOL (List of Lists) representation written in cython for better time and memory efficiency.\nThe building might take a lot of memory and time, so it's recommended to save the graph in a pickle file.\n\nBefore building the donors' graph, all the donors' HLAs must be imputed using `grim`.\nThen all the imputation files must be saved under the same directory.\n\n```python\nimport os\nfrom grma.donorsgraph.build_donors_graph import BuildMatchingGraph\n\nPATH_TO_DONORS_DIR = \"data/donors_dir\"\nPATH_TO_DONORS_GRAPH = \"output/donors_graph.pkl\"\n\nos.makedirs(f\"output\", exist_ok=True)\n\nbuild_matching = BuildMatchingGraph(PATH_TO_DONORS_DIR)\ngraph = build_matching.graph # access the donors' graph\n\nbuild_matching.to_pickle(PATH_TO_DONORS_GRAPH) # save the donors' graph to pickle\n```\n\n### Search & Match before imputation to patients\nThe function `matching` finds matches up to 3 mismatches and return a `pandas.DataFrame` object of the matches sorted by number of mismatches and their score.\n\nThe function get these parameters:\n* match_graph: a grma donors' graph object - `grma.match.Graph`\n* grim_config_file: a path to `grim` configuration file\n\n\n```python\nfrom grma.match import Graph, matching\n\nPATH_TO_DONORS_GRAPH = \"output/donors_graph.pkl\"\nPATH_CONGIF_FILE = \"conf/minimal-configuration.json\"\n\n\n# The donors' graph we built earlier\ndonors_graph = Graph.from_pickle(PATH_TO_DONORS_GRAPH)\n\n\n# matching_results is a dict - {patient_id: the patient's result dataframe}\nmatching_results = matching(donors_graph,PATH_CONGIF_FILE, search_id=1, donors_info=[],\n threshold=0.1, cutoff=100, save_to_csv=True, output_dir=\"results\")\n\n```\n\n`matching` takes some optional parameters, which you might want to change:\n\n* search_id: An integer identification of the search. default is 0.\n* donors_info: An iterable of fields from the database to include in the results. default is None.\n* threshold: Minimal score value for a valid match. default is 0.1.\n* cutof: Maximum number of matches to return. default is 50.\n* verbose: A boolean flag for whether to print the documentation. default is False\n* save_to_csv: A boolean flag for whether to save the matching results into a csv file. default is False. If the field is set to True, upon completion of the function, it will generate a directory named `search_1`\n* `output_dir`: output directory to write match results file to\n\n### Search & Match after imputation to patients\n\nThe function `find_mathces` find matches up to 3 mismatches and return a `pandas.DataFrame` object of the matches\n sorted by number of mismatches and their score.\n\nThey get these parameters:\n* imputation_filename: a path to the file of the patients' typing.\n* match_graph: a grma donors' graph object - `grma.match.Graph`\n\n```python\nfrom grma.match import Graph, find_matches\n\nPATH_TO_PATIENTS_FILE = \"data/patients_file.txt\"\nPATH_TO_DONORS_GRAPH = \"output/donors_graph.pkl\"\n\n# The donors' graph we built earlier\ndonors_graph = Graph.from_pickle(PATH_TO_DONORS_GRAPH)\nmatching_results = find_matches(PATH_TO_PATIENTS_FILE, donors_graph)\n\n# matching_results is a dict - {patient_id: the patient's result dataframe}\n\nfor patient, df in matching_results.items():\n # Use here the dataframe 'df' with the results for 'patient'\n print(patient, df)\n```\n\n`find_matches` takes some optional parameters, which you might want to change:\n* search_id: An integer identification of the search. default is 0.\n* donors_info: An iterable of fields from the database to include in the results. default is None.\n* threshold: Minimal score value for a valid match. default is 0.1.\n* cutof: Maximum number of matches to return. default is 50.\n* verbose: A boolean flag for whether to print the documentation. default is False\n* save_to_csv: A boolean flag for whether to save the matching results into a csv file. default is False.\nIf the field is set to True, upon completion of the function, it will generate a directory named `Matching_Result\ns_1`.\n* calculate_time: A boolean flag for whether to return the matching time for patient. default is False.\n In case `calculate_time=True` the output will be dict like this: `{patient_id: (results_dataframe, time)}`\n* `output_dir`: output directory to write match results file to\n\n### Set Database\nIn order to get in the matching results more information about the donors than the matching information,\none can set a database that has all the donors' information in it.\nThe database must be a `pandas.DataFrame` that its indexes are the donors' IDs.\n\nAfter setting the database, when calling one of the matching functions,\nyou may set in the `donor_info` variable a `list` with the names of the columns you want to join to the result dataframe from the database.\n\nExample of setting the database:\n\n```python\nimport pandas as pd\nfrom grma.match import set_database\n\ndonors = [0, 1, 2]\ndatabase = pd.DataFrame([[30], [32], [25]], columns=[\"Age\"], index=donors)\n\nset_database(database)\n```\n\n\n# How to contribute:\n\n1. Fork the repository: https://github.com/nmdp-bioinformatics/py-graph-match.git\n2. Clone the repository locally\n ```shell\n git clone https://github.com/<Your-Github-ID>/py-graph-match.git\n cd py-graph-match\n ```\n3. Make a virtual environment and activate it, run `make venv`\n ```shell\n > make venv\n python3 -m venv venv --prompt urban-potato-venv\n =====================================================================\n To activate the new virtual environment, execute the following from your shell\n source venv/bin/activate\n ```\n4. Source the virtual environment\n ```shell\n source venv/bin/activate\n ```\n5. Development workflow is driven through `Makefile`. Use `make` to list show all targets.\n ```\n > make\n clean remove all build, test, coverage and Python artifacts\n clean-build remove build artifacts\n clean-pyc remove Python file artifacts\n clean-test remove test and coverage artifacts\n lint check style with flake8\n behave run the behave tests, generate and serve report\n pytest run tests quickly with the default Python\n test run all(BDD and unit) tests\n coverage check code coverage quickly with the default Python\n dist builds source and wheel package\n docker-build build a docker image for the service\n docker build a docker image for the service\n install install the package to the active Python's site-packages\n venv creates a Python3 virtualenv environment in venv\n activate activate a virtual environment. Run `make venv` before activating.\n ```\n6. Install all the development dependencies. Will install packages from all `requirements-*.txt` files.\n ```shell\n make install\n ```\n7. The Gherkin Feature files, step files and pytest files go in `tests` directory:\n ```\n tests\n |-- features\n | |-- algorithm\n | | `-- SLUG\\ Match.feature\n | `-- definition\n | `-- Class\\ I\\ HLA\\ Alleles.feature\n |-- steps\n | |-- HLA_alleles.py\n | `-- SLUG_match.py\n `-- unit\n `-- test_py-graph-match.py\n ```\n8. Package Module files go in the `py-graph-match` directory.\n ```\n py-graph-match\n |-- __init__.py\n |-- algorithm\n | `-- match.py\n |-- model\n | |-- allele.py\n | `-- slug.py\n `-- py-graph-match.py\n ```\n9. Run all tests with `make test` or different tests with `make behave` or `make pytest`. `make behave` will generate report files and open the browser to the report.\n10. Use `python app.py` to run the Flask service app in debug mode. Service will be available at http://localhost:8080/\n11. Use `make docker-build` to build a docker image using the current `Dockerfile`.\n12. `make docker` will build and run the docker image with the service. Service will be available at http://localhost:8080/\n\n\n=======\nHistory\n=======\n\n0.0.1 (2021-08-25)\n------------------\n\n* First release on PyPI.\n",
"bugtrack_url": null,
"license": "LGPL 3.0",
"summary": "Graph Match",
"version": "0.0.8",
"project_urls": {
"Homepage": "https://github.com/nmdp-bioinformatics/"
},
"split_keywords": [
"graph",
"hla"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "5c66578e542cb69c72b33c5e420b7d7ce7ca205214031a7ee51bb8830faa59b1",
"md5": "cf5def11c56206556651c7eac45698d7",
"sha256": "1a756707d4405e4cae9a7210ddc467d1998a6c57bb306e2fce1d13f09eec2b0b"
},
"downloads": -1,
"filename": "py-graph-match-0.0.8.tar.gz",
"has_sig": false,
"md5_digest": "cf5def11c56206556651c7eac45698d7",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 32526,
"upload_time": "2024-01-29T17:00:42",
"upload_time_iso_8601": "2024-01-29T17:00:42.142124Z",
"url": "https://files.pythonhosted.org/packages/5c/66/578e542cb69c72b33c5e420b7d7ce7ca205214031a7ee51bb8830faa59b1/py-graph-match-0.0.8.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-01-29 17:00:42",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "py-graph-match"
}