# py-graph-imputation
[![PyPi Version](https://img.shields.io/pypi/v/py-graph-imputation.svg)](https://pypi.python.org/pypi/py-graph-imputation)
* [Graph Imputation](#graph-imputation)
* [Development](#develop)
* [Running A Minimal Example Imputation](#running-a-minimal-configuration-example)
### Graph Imputation
`py-graph-imputation` is the successor of [GRIMM](https://github.com/nmdp-bioinformatics/grimm) written in Python and based on [NetworkX](https://networkx.org/)
![GRIM Dependencies](images/py-graph-imputation.png)
### Use `py-graph-imputation`
#### Install `py-graph-imputation` from PyPi
```
pip install py-graph-imputation
```
#### Get Frequency Data and Subject Data and Configuration File
For an example, get [example-conf-data.zip](https://github.com/nmdp-bioinformatics/py-graph-imputation/tree/master/example-conf-data.zip)
Unzip the folder so it appears as:
```
conf
|-- README.md
`-- minimal-configuration.json
data
|-- freqs
| `-- CAU.freqs.gz
`-- subjects
`-- donor.csv
```
#### Modify the configuration.json to suit your need
#### Produce HPF csv file from Frequency Data
```
>>> from graph_generation.generate_hpf import produce_hpf
>>> produce_hpf(conf_file='conf/minimal-configuration.json')
****************************************************************************************************
Conversion to HPF file based on following configuration:
Population: ['CAU']
Frequency File Directory: data/freqs
Output File: output/hpf.csv
****************************************************************************************************
Reading Frequency File: data/freqs/CAU.freqs.gz
Writing hpf File: output/hpf.csv
```
This will produce the files which will be used for graph generation:
```
output
|-- hpf.csv # CSV file of Haplotype, Populatio, Freq
`-- pop_counts_file.txt # Size of each population
```
#### Generate the Graph (nodes and edges) files
```
>>> from grim.grim import graph_freqs
>>> graph_freqs(conf_file='conf/minimal-configuration.json')
****************************************************************************************************
Performing graph generation based on following configuration:
Population: ['CAU']
Freq File: output/hpf.csv
Freq Trim Threshold: 1e-05
****************************************************************************************************
```
This will produce the following files:
```
output
`-- csv
|-- edges.csv
|-- info_node.csv
|-- nodes.csv
`-- top_links.csv
```
#### Produce Imputation Results for Subjects
```
>>> from grim.grim import impute
>>> impute(conf_file='conf/minimal-configuration.json')
****************************************************************************************************
Performing imputation based on:
Population: ['CAU']
Priority: {'alpha': 0.4999999, 'eta': 0, 'beta': 1e-07, 'gamma': 1e-07, 'delta': 0.4999999}
UNK priority: SR
Epsilon: 0.001
Plan B: True
Number of Results: 10
Number of Population Results: 100
Nodes File: output/csv/nodes.csv
Top Links File: output/csv/edges.csv
Input File: data/subjects/donor.csv
Output UMUG Format: True
Output UMUG Freq Filename: output/don.umug
Output UMUG Pops Filename: output/don.umug.pops
Output Haplotype Format: True
Output HAP Freq Filename: output/don.pmug
Output HAP Pops Filename: output/don.pmug.pops
Output Miss Filename: output/don.miss
Output Problem Filename: output/don.problem
Factor Missing Data: 0.0001
Loci Map: {'A': 1, 'B': 2, 'C': 3, 'DQB1': 4, 'DRB1': 5}
Plan B Matrix: [[[1, 2, 3, 4, 5]], [[1, 2, 3], [4, 5]], [[1], [2, 3], [4, 5]], [[1, 2, 3], [4], [5]], [[1], [2, 3], [4], [5]], [[1], [2], [3], [4], [5]]]
Pops Count File: output/pop_counts_file.txt
Use Pops Count File: output/pop_counts_file.txt
Number of Options Threshold: 100000
Max Number of haplotypes in phase: 100
Save space mode: False
****************************************************************************************************
0 Subject: D1 8400 haplotypes
0 Subject: D1 6028 haplotypes
0.09946062499999186
```
This will produce files in `output` directory as:
```
├── output
│ ├── don.miss # Cases that failed imputation (e.g. incorrect typing etc.)
│ ├── don.pmug # Phased imputation as PMUG GL String
│ ├── don.pmug.pops # Population for Phased Imputation
│ ├── don.problem # List of errors
│ ├── don.umug # Unphased imputation as UMUG GL String
│ ├── don.umug.pops # Population for Phased Imputation
```
### Development
How to develop on the project locally.
1. Make sure the following pre-requites are installed.
1. `git`
2. `python >= 3.8`
3. build tools eg `make`
2. Clone the repository locally
```shell
git clone git@github.com:nmdp-bioinformatics/py-graph-imputation.git
cd py-graph-imputation
```
3. Make a virtual environment and activate it, run `make venv`
```shell
> make venv
python3 -m venv venv --prompt py-graph-imputation-venv
=====================================================================
To activate the new virtual environment, execute the following from your shell
source venv/bin/activate
```
4. Source the virtual environment
```shell
source venv/bin/activate
```
5. Development workflow is driven through `Makefile`. Use `make` to list show all targets.
```
> make
clean remove all build, test, coverage and Python artifacts
clean-build remove build artifacts
clean-pyc remove Python file artifacts
clean-test remove test and coverage artifacts
lint check style with flake8
behave run the behave tests, generate and serve report
pytest run tests quickly with the default Python
test run all(BDD and unit) tests
coverage check code coverage quickly with the default Python
dist builds source and wheel package
docker-build build a docker image for the service
docker build a docker image for the service
install install the package to the active Python's site-packages
venv creates a Python3 virtualenv environment in venv
activate activate a virtual environment. Run `make venv` before activating.
```
6. Install all the development dependencies. Will install packages from all `requirements-*.txt` files.
```shell
make install
```
7. Package Module files go in the `grim` directory.
```
grim
|-- __init__.py
|-- grim.py
`-- imputation
|-- __init__.py
|-- cutils.pyx
|-- cypher_plan_b.py
|-- cypher_query.py
|-- impute.py
`-- networkx_graph.py
```
8. Run all tests with `make test` or different tests with `make behave` or `make pytest`.
9. Run `make lint` to run the linter and black formatter.
10. Use `python app.py` to run the Flask service app in debug mode. Service will be available at http://localhost:8080/
11. Use `make docker-build` to build a docker image using the current `Dockerfile`.
12. `make docker` will build and run the docker image with the service. Service will be available at http://localhost:8080/
### Running a minimal configuration example
From the main directory of the repo run:
```
scripts/build-imputation-validation.sh
```
This will prepare and load frequency data into the graph and run imputation on a sample set of subjects.
The execution is driven by the configuration file: `conf/minimal-configuration.json`
It takes input from this file:
```
data/subjects/donor.csv
```
And generates an `output` directory with these contents:
```
output
├── don.miss
├── don.pmug
├── don.pmug.pops
├── don.problem
├── don.umug
└── don.umug.pops
```
The `.problem` file contains cases that failed due to serious errors (e.g., invalid HLA).
The `.miss` file contains cases where there was no output possible given the input, frequencies and configuration options.
The `.pmug` file contains the Phased Multi-locus Unambiguous Genotypes.
The `.umug` file contains the Un-phased Multi-locus Unambiguous Genotypes.
The format of both files is (csv):
* id
* genotype - in glstring format
* frequency
* rank
The `.pmug.pops` and `.umug.pops` contain the corresponding population assignments.
The format of the `.pops` files is (csv):
* id
* pop1
* pop2
* frequency
* rank
=======
History
=======
0.0.1 (2021-08-25)
------------------
* First release on PyPI.
Raw data
{
"_id": null,
"home_page": "https://github.com/nmdp-bioinformatics/py-grim",
"name": "py-graph-imputation",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "",
"keywords": "grim",
"author": "Pradeep Bashyal",
"author_email": "pbashyal@nmdp.org",
"download_url": "https://files.pythonhosted.org/packages/66/6a/6f3b40031c69864430f05e6d56b4ccb4bf003db701160463f1ed551b0f2a/py-graph-imputation-0.1.0.tar.gz",
"platform": null,
"description": "# py-graph-imputation\n[![PyPi Version](https://img.shields.io/pypi/v/py-graph-imputation.svg)](https://pypi.python.org/pypi/py-graph-imputation)\n\n* [Graph Imputation](#graph-imputation)\n* [Development](#develop)\n* [Running A Minimal Example Imputation](#running-a-minimal-configuration-example)\n\n### Graph Imputation\n\n`py-graph-imputation` is the successor of [GRIMM](https://github.com/nmdp-bioinformatics/grimm) written in Python and based on [NetworkX](https://networkx.org/)\n\n![GRIM Dependencies](images/py-graph-imputation.png)\n\n### Use `py-graph-imputation`\n\n#### Install `py-graph-imputation` from PyPi\n```\npip install py-graph-imputation\n```\n\n#### Get Frequency Data and Subject Data and Configuration File\n\nFor an example, get [example-conf-data.zip](https://github.com/nmdp-bioinformatics/py-graph-imputation/tree/master/example-conf-data.zip)\n\nUnzip the folder so it appears as:\n\n```\nconf\n|-- README.md\n`-- minimal-configuration.json\ndata\n|-- freqs\n| `-- CAU.freqs.gz\n`-- subjects\n `-- donor.csv\n```\n\n#### Modify the configuration.json to suit your need\n\n\n#### Produce HPF csv file from Frequency Data\n\n```\n>>> from graph_generation.generate_hpf import produce_hpf\n>>> produce_hpf(conf_file='conf/minimal-configuration.json')\n****************************************************************************************************\nConversion to HPF file based on following configuration:\n\tPopulation: ['CAU']\n\tFrequency File Directory: data/freqs\n\tOutput File: output/hpf.csv\n****************************************************************************************************\nReading Frequency File:\t data/freqs/CAU.freqs.gz\nWriting hpf File:\t output/hpf.csv\n```\n\nThis will produce the files which will be used for graph generation:\n\n```\noutput\n|-- hpf.csv # CSV file of Haplotype, Populatio, Freq\n`-- pop_counts_file.txt # Size of each population\n```\n\n#### Generate the Graph (nodes and edges) files\n\n```\n>>> from grim.grim import graph_freqs\n\n>>> graph_freqs(conf_file='conf/minimal-configuration.json')\n****************************************************************************************************\nPerforming graph generation based on following configuration:\n\tPopulation: ['CAU']\n\tFreq File: output/hpf.csv\n\tFreq Trim Threshold: 1e-05\n****************************************************************************************************\n```\n\nThis will produce the following files:\n\n```\noutput\n`-- csv\n |-- edges.csv\n |-- info_node.csv\n |-- nodes.csv\n `-- top_links.csv\n```\n\n#### Produce Imputation Results for Subjects\n\n```\n>>> from grim.grim import impute\n>>> impute(conf_file='conf/minimal-configuration.json')\n****************************************************************************************************\nPerforming imputation based on:\n\tPopulation: ['CAU']\n\tPriority: {'alpha': 0.4999999, 'eta': 0, 'beta': 1e-07, 'gamma': 1e-07, 'delta': 0.4999999}\n\tUNK priority: SR\n\tEpsilon: 0.001\n\tPlan B: True\n\tNumber of Results: 10\n\tNumber of Population Results: 100\n\tNodes File: output/csv/nodes.csv\n\tTop Links File: output/csv/edges.csv\n\tInput File: data/subjects/donor.csv\n\tOutput UMUG Format: True\n\tOutput UMUG Freq Filename: output/don.umug\n\tOutput UMUG Pops Filename: output/don.umug.pops\n\tOutput Haplotype Format: True\n\tOutput HAP Freq Filename: output/don.pmug\n\tOutput HAP Pops Filename: output/don.pmug.pops\n\tOutput Miss Filename: output/don.miss\n\tOutput Problem Filename: output/don.problem\n\tFactor Missing Data: 0.0001\n\tLoci Map: {'A': 1, 'B': 2, 'C': 3, 'DQB1': 4, 'DRB1': 5}\n\tPlan B Matrix: [[[1, 2, 3, 4, 5]], [[1, 2, 3], [4, 5]], [[1], [2, 3], [4, 5]], [[1, 2, 3], [4], [5]], [[1], [2, 3], [4], [5]], [[1], [2], [3], [4], [5]]]\n\tPops Count File: output/pop_counts_file.txt\n\tUse Pops Count File: output/pop_counts_file.txt\n\tNumber of Options Threshold: 100000\n\tMax Number of haplotypes in phase: 100\n\tSave space mode: False\n****************************************************************************************************\n0 Subject: D1 8400 haplotypes\n0 Subject: D1 6028 haplotypes\n0.09946062499999186\n```\n\nThis will produce files in `output` directory as:\n\n```\n\u251c\u2500\u2500 output\n\u2502 \u251c\u2500\u2500 don.miss # Cases that failed imputation (e.g. incorrect typing etc.)\n\u2502 \u251c\u2500\u2500 don.pmug # Phased imputation as PMUG GL String\n\u2502 \u251c\u2500\u2500 don.pmug.pops # Population for Phased Imputation\n\u2502 \u251c\u2500\u2500 don.problem # List of errors\n\u2502 \u251c\u2500\u2500 don.umug # Unphased imputation as UMUG GL String\n\u2502 \u251c\u2500\u2500 don.umug.pops # Population for Phased Imputation\n```\n\n\n\n\n\n### Development\nHow to develop on the project locally.\n\n1. Make sure the following pre-requites are installed.\n 1. `git`\n 2. `python >= 3.8`\n 3. build tools eg `make`\n2. Clone the repository locally\n ```shell\n git clone git@github.com:nmdp-bioinformatics/py-graph-imputation.git\n cd py-graph-imputation\n ```\n3. Make a virtual environment and activate it, run `make venv`\n ```shell\n > make venv\n python3 -m venv venv --prompt py-graph-imputation-venv\n =====================================================================\n To activate the new virtual environment, execute the following from your shell\n source venv/bin/activate\n ```\n4. Source the virtual environment\n ```shell\n source venv/bin/activate\n ```\n5. Development workflow is driven through `Makefile`. Use `make` to list show all targets.\n ```\n > make\n clean remove all build, test, coverage and Python artifacts\n clean-build remove build artifacts\n clean-pyc remove Python file artifacts\n clean-test remove test and coverage artifacts\n lint check style with flake8\n behave run the behave tests, generate and serve report\n pytest run tests quickly with the default Python\n test run all(BDD and unit) tests\n coverage check code coverage quickly with the default Python\n dist builds source and wheel package\n docker-build build a docker image for the service\n docker build a docker image for the service\n install install the package to the active Python's site-packages\n venv creates a Python3 virtualenv environment in venv\n activate activate a virtual environment. Run `make venv` before activating.\n ```\n6. Install all the development dependencies. Will install packages from all `requirements-*.txt` files.\n ```shell\n make install\n ```\n7. Package Module files go in the `grim` directory.\n ```\n grim\n |-- __init__.py\n |-- grim.py\n `-- imputation\n |-- __init__.py\n |-- cutils.pyx\n |-- cypher_plan_b.py\n |-- cypher_query.py\n |-- impute.py\n `-- networkx_graph.py\n ```\n8. Run all tests with `make test` or different tests with `make behave` or `make pytest`.\n9. Run `make lint` to run the linter and black formatter.\n10. Use `python app.py` to run the Flask service app in debug mode. Service will be available at http://localhost:8080/\n11. Use `make docker-build` to build a docker image using the current `Dockerfile`.\n12. `make docker` will build and run the docker image with the service. Service will be available at http://localhost:8080/\n\n\n### Running a minimal configuration example\n\nFrom the main directory of the repo run:\n```\nscripts/build-imputation-validation.sh\n```\n\nThis will prepare and load frequency data into the graph and run imputation on a sample set of subjects.\n\nThe execution is driven by the configuration file: `conf/minimal-configuration.json`\n\nIt takes input from this file:\n```\ndata/subjects/donor.csv\n```\n\nAnd generates an `output` directory with these contents:\n\n```\noutput\n\u251c\u2500\u2500 don.miss\n\u251c\u2500\u2500 don.pmug\n\u251c\u2500\u2500 don.pmug.pops\n\u251c\u2500\u2500 don.problem\n\u251c\u2500\u2500 don.umug\n\u2514\u2500\u2500 don.umug.pops\n```\n\nThe `.problem` file contains cases that failed due to serious errors (e.g., invalid HLA).\n\nThe `.miss` file contains cases where there was no output possible given the input, frequencies and configuration options.\n\nThe `.pmug` file contains the Phased Multi-locus Unambiguous Genotypes.\n\nThe `.umug` file contains the Un-phased Multi-locus Unambiguous Genotypes.\n\n\nThe format of both files is (csv):\n\n* id\n* genotype - in glstring format\n* frequency\n* rank\n\n\nThe `.pmug.pops` and `.umug.pops` contain the corresponding population assignments.\n\nThe format of the `.pops` files is (csv):\n\n* id\n* pop1\n* pop2\n* frequency\n* rank\n\n\n=======\nHistory\n=======\n\n0.0.1 (2021-08-25)\n------------------\n\n* First release on PyPI.\n",
"bugtrack_url": null,
"license": "LGPL 3.0",
"summary": "Graph Based Imputation",
"version": "0.1.0",
"project_urls": {
"Homepage": "https://github.com/nmdp-bioinformatics/py-grim"
},
"split_keywords": [
"grim"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "666a6f3b40031c69864430f05e6d56b4ccb4bf003db701160463f1ed551b0f2a",
"md5": "fb3dcf0f21937e29b72b32db6cd98027",
"sha256": "62b1c909d4e91813907e63a92e133f9b4818fd1bf4a2cd89865f3682a575bd7e"
},
"downloads": -1,
"filename": "py-graph-imputation-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "fb3dcf0f21937e29b72b32db6cd98027",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 75614,
"upload_time": "2024-02-23T21:04:49",
"upload_time_iso_8601": "2024-02-23T21:04:49.757544Z",
"url": "https://files.pythonhosted.org/packages/66/6a/6f3b40031c69864430f05e6d56b4ccb4bf003db701160463f1ed551b0f2a/py-graph-imputation-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-02-23 21:04:49",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "nmdp-bioinformatics",
"github_project": "py-grim",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "cython",
"specs": [
[
"==",
"0.29.32"
]
]
},
{
"name": "numpy",
"specs": [
[
">=",
"1.20.2"
]
]
},
{
"name": "pandas",
"specs": []
},
{
"name": "tqdm",
"specs": []
}
],
"lcname": "py-graph-imputation"
}