py-graph-imputation


Namepy-graph-imputation JSON
Version 0.1.0 PyPI version JSON
download
home_pagehttps://github.com/nmdp-bioinformatics/py-grim
SummaryGraph Based Imputation
upload_time2024-02-23 21:04:49
maintainer
docs_urlNone
authorPradeep Bashyal
requires_python>=3.8
licenseLGPL 3.0
keywords grim
VCS
bugtrack_url
requirements cython numpy pandas tqdm
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # py-graph-imputation
[![PyPi Version](https://img.shields.io/pypi/v/py-graph-imputation.svg)](https://pypi.python.org/pypi/py-graph-imputation)

* [Graph Imputation](#graph-imputation)
* [Development](#develop)
* [Running A Minimal Example Imputation](#running-a-minimal-configuration-example)

### Graph Imputation

`py-graph-imputation` is the successor of [GRIMM](https://github.com/nmdp-bioinformatics/grimm) written in Python and based on [NetworkX](https://networkx.org/)

![GRIM Dependencies](images/py-graph-imputation.png)

### Use `py-graph-imputation`

#### Install `py-graph-imputation` from PyPi
```
pip install py-graph-imputation
```

#### Get Frequency Data and Subject Data and Configuration File

For an example, get [example-conf-data.zip](https://github.com/nmdp-bioinformatics/py-graph-imputation/tree/master/example-conf-data.zip)

Unzip the folder so it appears as:

```
conf
|-- README.md
`-- minimal-configuration.json
data
|-- freqs
|   `-- CAU.freqs.gz
`-- subjects
    `-- donor.csv
```

#### Modify the configuration.json to suit your need


#### Produce HPF csv file from Frequency Data

```
>>> from graph_generation.generate_hpf import produce_hpf
>>> produce_hpf(conf_file='conf/minimal-configuration.json')
****************************************************************************************************
Conversion to HPF file based on following configuration:
	Population: ['CAU']
	Frequency File Directory: data/freqs
	Output File: output/hpf.csv
****************************************************************************************************
Reading Frequency File:	 data/freqs/CAU.freqs.gz
Writing hpf File:	 output/hpf.csv
```

This will produce the files which will be used for graph generation:

```
output
|-- hpf.csv                         # CSV file of Haplotype, Populatio, Freq
`-- pop_counts_file.txt             # Size of each population
```

#### Generate the Graph (nodes and edges) files

```
>>> from grim.grim import graph_freqs

>>> graph_freqs(conf_file='conf/minimal-configuration.json')
****************************************************************************************************
Performing graph generation based on following configuration:
	Population: ['CAU']
	Freq File: output/hpf.csv
	Freq Trim Threshold: 1e-05
****************************************************************************************************
```

This will produce the following files:

```
output
`-- csv
    |-- edges.csv
    |-- info_node.csv
    |-- nodes.csv
    `-- top_links.csv
```

#### Produce Imputation Results for Subjects

```
>>> from grim.grim import impute
>>> impute(conf_file='conf/minimal-configuration.json')
****************************************************************************************************
Performing imputation based on:
	Population: ['CAU']
	Priority: {'alpha': 0.4999999, 'eta': 0, 'beta': 1e-07, 'gamma': 1e-07, 'delta': 0.4999999}
	UNK priority: SR
	Epsilon: 0.001
	Plan B: True
	Number of Results: 10
	Number of Population Results: 100
	Nodes File: output/csv/nodes.csv
	Top Links File: output/csv/edges.csv
	Input File: data/subjects/donor.csv
	Output UMUG Format: True
	Output UMUG Freq Filename: output/don.umug
	Output UMUG Pops Filename: output/don.umug.pops
	Output Haplotype Format: True
	Output HAP Freq Filename: output/don.pmug
	Output HAP Pops Filename: output/don.pmug.pops
	Output Miss Filename: output/don.miss
	Output Problem Filename: output/don.problem
	Factor Missing Data: 0.0001
	Loci Map: {'A': 1, 'B': 2, 'C': 3, 'DQB1': 4, 'DRB1': 5}
	Plan B Matrix: [[[1, 2, 3, 4, 5]], [[1, 2, 3], [4, 5]], [[1], [2, 3], [4, 5]], [[1, 2, 3], [4], [5]], [[1], [2, 3], [4], [5]], [[1], [2], [3], [4], [5]]]
	Pops Count File: output/pop_counts_file.txt
	Use Pops Count File: output/pop_counts_file.txt
	Number of Options Threshold: 100000
	Max Number of haplotypes in phase: 100
	Save space mode: False
****************************************************************************************************
0 Subject: D1 8400 haplotypes
0 Subject: D1 6028 haplotypes
0.09946062499999186
```

This will produce files in `output` directory as:

```
├── output
│ ├── don.miss                # Cases that failed imputation (e.g. incorrect typing etc.)
│ ├── don.pmug                # Phased imputation as PMUG GL String
│ ├── don.pmug.pops           # Population for Phased Imputation
│ ├── don.problem             # List of errors
│ ├── don.umug                # Unphased imputation as UMUG GL String
│ ├── don.umug.pops           # Population for Phased Imputation
```





### Development
How to develop on the project locally.

1. Make sure the following pre-requites are installed.
   1. `git`
   2. `python >= 3.8`
   3. build tools eg `make`
2. Clone the repository locally
    ```shell
    git clone git@github.com:nmdp-bioinformatics/py-graph-imputation.git
    cd py-graph-imputation
    ```
3. Make a virtual environment and activate it, run `make venv`
   ```shell
    > make venv
      python3 -m venv venv --prompt py-graph-imputation-venv
      =====================================================================
    To activate the new virtual environment, execute the following from your shell
    source venv/bin/activate
   ```
4. Source the virtual environment
   ```shell
   source venv/bin/activate
   ```
5. Development workflow is driven through `Makefile`. Use `make` to list show all targets.
   ```
    > make
    clean                remove all build, test, coverage and Python artifacts
    clean-build          remove build artifacts
    clean-pyc            remove Python file artifacts
    clean-test           remove test and coverage artifacts
    lint                 check style with flake8
    behave               run the behave tests, generate and serve report
    pytest               run tests quickly with the default Python
    test                 run all(BDD and unit) tests
    coverage             check code coverage quickly with the default Python
    dist                 builds source and wheel package
    docker-build         build a docker image for the service
    docker               build a docker image for the service
    install              install the package to the active Python's site-packages
    venv                 creates a Python3 virtualenv environment in venv
    activate             activate a virtual environment. Run `make venv` before activating.
   ```
6. Install all the development dependencies. Will install packages from all `requirements-*.txt` files.
   ```shell
    make install
   ```
7. Package Module files go in the `grim` directory.
    ```
    grim
    |-- __init__.py
    |-- grim.py
    `-- imputation
        |-- __init__.py
        |-- cutils.pyx
        |-- cypher_plan_b.py
        |-- cypher_query.py
        |-- impute.py
        `-- networkx_graph.py
    ```
8. Run all tests with `make test` or different tests with `make behave` or `make pytest`.
9. Run `make lint` to run the linter and black formatter.
10. Use `python app.py` to run the Flask service app in debug mode. Service will be available at http://localhost:8080/
11. Use `make docker-build` to build a docker image using the current `Dockerfile`.
12. `make docker` will build and run the docker image with the service.  Service will be available at http://localhost:8080/


### Running a minimal configuration example

From the main directory of the repo run:
```
scripts/build-imputation-validation.sh
```

This will prepare and load frequency data into the graph and run imputation on a sample set of subjects.

The execution is driven by the configuration file: `conf/minimal-configuration.json`

It takes input from this file:
```
data/subjects/donor.csv
```

And generates an `output` directory with these contents:

```
output
├── don.miss
├── don.pmug
├── don.pmug.pops
├── don.problem
├── don.umug
└── don.umug.pops
```

The `.problem` file contains cases that failed due to serious errors (e.g., invalid HLA).

The `.miss` file contains cases where there was no output possible given the input, frequencies and configuration options.

The `.pmug` file contains the Phased Multi-locus Unambiguous Genotypes.

The `.umug` file contains the Un-phased Multi-locus Unambiguous Genotypes.


The format of both files is (csv):

* id
* genotype - in glstring format
* frequency
* rank


The `.pmug.pops` and `.umug.pops` contain the corresponding population assignments.

The format of the `.pops` files is (csv):

* id
* pop1
* pop2
* frequency
* rank


=======
History
=======

0.0.1 (2021-08-25)
------------------

* First release on PyPI.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/nmdp-bioinformatics/py-grim",
    "name": "py-graph-imputation",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "grim",
    "author": "Pradeep Bashyal",
    "author_email": "pbashyal@nmdp.org",
    "download_url": "https://files.pythonhosted.org/packages/66/6a/6f3b40031c69864430f05e6d56b4ccb4bf003db701160463f1ed551b0f2a/py-graph-imputation-0.1.0.tar.gz",
    "platform": null,
    "description": "# py-graph-imputation\n[![PyPi Version](https://img.shields.io/pypi/v/py-graph-imputation.svg)](https://pypi.python.org/pypi/py-graph-imputation)\n\n* [Graph Imputation](#graph-imputation)\n* [Development](#develop)\n* [Running A Minimal Example Imputation](#running-a-minimal-configuration-example)\n\n### Graph Imputation\n\n`py-graph-imputation` is the successor of [GRIMM](https://github.com/nmdp-bioinformatics/grimm) written in Python and based on [NetworkX](https://networkx.org/)\n\n![GRIM Dependencies](images/py-graph-imputation.png)\n\n### Use `py-graph-imputation`\n\n#### Install `py-graph-imputation` from PyPi\n```\npip install py-graph-imputation\n```\n\n#### Get Frequency Data and Subject Data and Configuration File\n\nFor an example, get [example-conf-data.zip](https://github.com/nmdp-bioinformatics/py-graph-imputation/tree/master/example-conf-data.zip)\n\nUnzip the folder so it appears as:\n\n```\nconf\n|-- README.md\n`-- minimal-configuration.json\ndata\n|-- freqs\n|   `-- CAU.freqs.gz\n`-- subjects\n    `-- donor.csv\n```\n\n#### Modify the configuration.json to suit your need\n\n\n#### Produce HPF csv file from Frequency Data\n\n```\n>>> from graph_generation.generate_hpf import produce_hpf\n>>> produce_hpf(conf_file='conf/minimal-configuration.json')\n****************************************************************************************************\nConversion to HPF file based on following configuration:\n\tPopulation: ['CAU']\n\tFrequency File Directory: data/freqs\n\tOutput File: output/hpf.csv\n****************************************************************************************************\nReading Frequency File:\t data/freqs/CAU.freqs.gz\nWriting hpf File:\t output/hpf.csv\n```\n\nThis will produce the files which will be used for graph generation:\n\n```\noutput\n|-- hpf.csv                         # CSV file of Haplotype, Populatio, Freq\n`-- pop_counts_file.txt             # Size of each population\n```\n\n#### Generate the Graph (nodes and edges) files\n\n```\n>>> from grim.grim import graph_freqs\n\n>>> graph_freqs(conf_file='conf/minimal-configuration.json')\n****************************************************************************************************\nPerforming graph generation based on following configuration:\n\tPopulation: ['CAU']\n\tFreq File: output/hpf.csv\n\tFreq Trim Threshold: 1e-05\n****************************************************************************************************\n```\n\nThis will produce the following files:\n\n```\noutput\n`-- csv\n    |-- edges.csv\n    |-- info_node.csv\n    |-- nodes.csv\n    `-- top_links.csv\n```\n\n#### Produce Imputation Results for Subjects\n\n```\n>>> from grim.grim import impute\n>>> impute(conf_file='conf/minimal-configuration.json')\n****************************************************************************************************\nPerforming imputation based on:\n\tPopulation: ['CAU']\n\tPriority: {'alpha': 0.4999999, 'eta': 0, 'beta': 1e-07, 'gamma': 1e-07, 'delta': 0.4999999}\n\tUNK priority: SR\n\tEpsilon: 0.001\n\tPlan B: True\n\tNumber of Results: 10\n\tNumber of Population Results: 100\n\tNodes File: output/csv/nodes.csv\n\tTop Links File: output/csv/edges.csv\n\tInput File: data/subjects/donor.csv\n\tOutput UMUG Format: True\n\tOutput UMUG Freq Filename: output/don.umug\n\tOutput UMUG Pops Filename: output/don.umug.pops\n\tOutput Haplotype Format: True\n\tOutput HAP Freq Filename: output/don.pmug\n\tOutput HAP Pops Filename: output/don.pmug.pops\n\tOutput Miss Filename: output/don.miss\n\tOutput Problem Filename: output/don.problem\n\tFactor Missing Data: 0.0001\n\tLoci Map: {'A': 1, 'B': 2, 'C': 3, 'DQB1': 4, 'DRB1': 5}\n\tPlan B Matrix: [[[1, 2, 3, 4, 5]], [[1, 2, 3], [4, 5]], [[1], [2, 3], [4, 5]], [[1, 2, 3], [4], [5]], [[1], [2, 3], [4], [5]], [[1], [2], [3], [4], [5]]]\n\tPops Count File: output/pop_counts_file.txt\n\tUse Pops Count File: output/pop_counts_file.txt\n\tNumber of Options Threshold: 100000\n\tMax Number of haplotypes in phase: 100\n\tSave space mode: False\n****************************************************************************************************\n0 Subject: D1 8400 haplotypes\n0 Subject: D1 6028 haplotypes\n0.09946062499999186\n```\n\nThis will produce files in `output` directory as:\n\n```\n\u251c\u2500\u2500 output\n\u2502 \u251c\u2500\u2500 don.miss                # Cases that failed imputation (e.g. incorrect typing etc.)\n\u2502 \u251c\u2500\u2500 don.pmug                # Phased imputation as PMUG GL String\n\u2502 \u251c\u2500\u2500 don.pmug.pops           # Population for Phased Imputation\n\u2502 \u251c\u2500\u2500 don.problem             # List of errors\n\u2502 \u251c\u2500\u2500 don.umug                # Unphased imputation as UMUG GL String\n\u2502 \u251c\u2500\u2500 don.umug.pops           # Population for Phased Imputation\n```\n\n\n\n\n\n### Development\nHow to develop on the project locally.\n\n1. Make sure the following pre-requites are installed.\n   1. `git`\n   2. `python >= 3.8`\n   3. build tools eg `make`\n2. Clone the repository locally\n    ```shell\n    git clone git@github.com:nmdp-bioinformatics/py-graph-imputation.git\n    cd py-graph-imputation\n    ```\n3. Make a virtual environment and activate it, run `make venv`\n   ```shell\n    > make venv\n      python3 -m venv venv --prompt py-graph-imputation-venv\n      =====================================================================\n    To activate the new virtual environment, execute the following from your shell\n    source venv/bin/activate\n   ```\n4. Source the virtual environment\n   ```shell\n   source venv/bin/activate\n   ```\n5. Development workflow is driven through `Makefile`. Use `make` to list show all targets.\n   ```\n    > make\n    clean                remove all build, test, coverage and Python artifacts\n    clean-build          remove build artifacts\n    clean-pyc            remove Python file artifacts\n    clean-test           remove test and coverage artifacts\n    lint                 check style with flake8\n    behave               run the behave tests, generate and serve report\n    pytest               run tests quickly with the default Python\n    test                 run all(BDD and unit) tests\n    coverage             check code coverage quickly with the default Python\n    dist                 builds source and wheel package\n    docker-build         build a docker image for the service\n    docker               build a docker image for the service\n    install              install the package to the active Python's site-packages\n    venv                 creates a Python3 virtualenv environment in venv\n    activate             activate a virtual environment. Run `make venv` before activating.\n   ```\n6. Install all the development dependencies. Will install packages from all `requirements-*.txt` files.\n   ```shell\n    make install\n   ```\n7. Package Module files go in the `grim` directory.\n    ```\n    grim\n    |-- __init__.py\n    |-- grim.py\n    `-- imputation\n        |-- __init__.py\n        |-- cutils.pyx\n        |-- cypher_plan_b.py\n        |-- cypher_query.py\n        |-- impute.py\n        `-- networkx_graph.py\n    ```\n8. Run all tests with `make test` or different tests with `make behave` or `make pytest`.\n9. Run `make lint` to run the linter and black formatter.\n10. Use `python app.py` to run the Flask service app in debug mode. Service will be available at http://localhost:8080/\n11. Use `make docker-build` to build a docker image using the current `Dockerfile`.\n12. `make docker` will build and run the docker image with the service.  Service will be available at http://localhost:8080/\n\n\n### Running a minimal configuration example\n\nFrom the main directory of the repo run:\n```\nscripts/build-imputation-validation.sh\n```\n\nThis will prepare and load frequency data into the graph and run imputation on a sample set of subjects.\n\nThe execution is driven by the configuration file: `conf/minimal-configuration.json`\n\nIt takes input from this file:\n```\ndata/subjects/donor.csv\n```\n\nAnd generates an `output` directory with these contents:\n\n```\noutput\n\u251c\u2500\u2500 don.miss\n\u251c\u2500\u2500 don.pmug\n\u251c\u2500\u2500 don.pmug.pops\n\u251c\u2500\u2500 don.problem\n\u251c\u2500\u2500 don.umug\n\u2514\u2500\u2500 don.umug.pops\n```\n\nThe `.problem` file contains cases that failed due to serious errors (e.g., invalid HLA).\n\nThe `.miss` file contains cases where there was no output possible given the input, frequencies and configuration options.\n\nThe `.pmug` file contains the Phased Multi-locus Unambiguous Genotypes.\n\nThe `.umug` file contains the Un-phased Multi-locus Unambiguous Genotypes.\n\n\nThe format of both files is (csv):\n\n* id\n* genotype - in glstring format\n* frequency\n* rank\n\n\nThe `.pmug.pops` and `.umug.pops` contain the corresponding population assignments.\n\nThe format of the `.pops` files is (csv):\n\n* id\n* pop1\n* pop2\n* frequency\n* rank\n\n\n=======\nHistory\n=======\n\n0.0.1 (2021-08-25)\n------------------\n\n* First release on PyPI.\n",
    "bugtrack_url": null,
    "license": "LGPL 3.0",
    "summary": "Graph Based Imputation",
    "version": "0.1.0",
    "project_urls": {
        "Homepage": "https://github.com/nmdp-bioinformatics/py-grim"
    },
    "split_keywords": [
        "grim"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "666a6f3b40031c69864430f05e6d56b4ccb4bf003db701160463f1ed551b0f2a",
                "md5": "fb3dcf0f21937e29b72b32db6cd98027",
                "sha256": "62b1c909d4e91813907e63a92e133f9b4818fd1bf4a2cd89865f3682a575bd7e"
            },
            "downloads": -1,
            "filename": "py-graph-imputation-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "fb3dcf0f21937e29b72b32db6cd98027",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 75614,
            "upload_time": "2024-02-23T21:04:49",
            "upload_time_iso_8601": "2024-02-23T21:04:49.757544Z",
            "url": "https://files.pythonhosted.org/packages/66/6a/6f3b40031c69864430f05e6d56b4ccb4bf003db701160463f1ed551b0f2a/py-graph-imputation-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-23 21:04:49",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "nmdp-bioinformatics",
    "github_project": "py-grim",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "cython",
            "specs": [
                [
                    "==",
                    "0.29.32"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    ">=",
                    "1.20.2"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": []
        },
        {
            "name": "tqdm",
            "specs": []
        }
    ],
    "lcname": "py-graph-imputation"
}
        
Elapsed time: 0.22428s