Name | phenopy JSON |
Version |
0.6.0
JSON |
| download |
home_page | |
Summary | Phenotype comparison scoring by semantic similarity. |
upload_time | 2023-06-17 03:54:30 |
maintainer | |
docs_url | None |
author | Kevin Arvai |
requires_python | >=3.9,<4.0 |
license | |
keywords |
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
[![python-version](https://img.shields.io/badge/python-3.6+-blue.svg)](https://www.python.org/downloads/release/python-360/)
[![github-actions](https://github.com/GeneDx/phenopy/workflows/Python%20package/badge.svg)](https://github.com/GeneDx/phenopy/actions)
[![codecov](https://codecov.io/gh/GeneDx/phenopy/branch/develop/graph/badge.svg)](https://codecov.io/gh/GeneDx/phenopy)
[![DOI](https://zenodo.org/badge/207335538.svg)](https://zenodo.org/badge/latestdoi/207335538)
# phenopy
`phenopy` was developed using Python 3.9 and functions to perform phenotype similarity scoring by semantic similarity. `phenopy` is a
lightweight but highly optimized command line tool and library to efficiently perform semantic similarity scoring on
generic entities with phenotype annotations from the [Human Phenotype Ontology (HPO)](https://hpo.jax.org/app/).
![Phenotype Similarity Clustering](https://raw.githubusercontent.com/GeneDx/phenopy/develop/notebooks/output/cluster_three_diseases.png)
## Installation
Install using pip:
```bash
pip install phenopy
```
Install from GitHub:
```bash
git clone https://github.com/GeneDx/phenopy.git
cd phenopy
pipx install poetry
poetry install
```
## Command Line Usage
### score
`phenopy` is primarily used as a command line tool. An entity, as described here, is presented as a sample, gene, or
disease, but could be any concept that warrants annotation of phenotype terms.
Use `phenopy score` to perform semantic similarity scoring in various formats. Write the results of any command to file
using `--output-file=/path/to/output_file.txt`
1. Score similarity of entities defined by the HPO terms from an input file against all the OMIM diseases in
`.phenopy/data/phenotype.hpoa`. We provide a test input file in the repo. The default summarization method is to
use `--summarization-method=BMWA` which weighs each diseases' phenotypes by the frequency of a phenotype seen in
each particular disease.
```bash
phenopy score tests/data/test.score.txt
```
Output:
```
#query entity_id score
118200 210100 0.0
118200 615779 0.0
118200 613266 0.0052
...
```
2. Score similarity of entities defined by the HPO terms from an input file against all the OMIM diseases in
`.phenopy/data/phenotype.hpoa`, to use the non-weighted summarization method use `--summarization-method=BMA` which
uses a traditional *best-match average* summarization of semantic similarity scores when comparing terms from record *a*
with terms from record *b*.
```bash
phenopy score tests/data/test.score.txt --summarization-method=BMWA
```
Output:
```
#query entity_id score
118200 210100 0.0
118200 615779 0.0
118200 613266 0.0052
...
```
3. Score similarity of an entities defined by the HPO terms from an input file against a custom list of entities with HPO annotations, referred to as the `--records-file`. Both files are in the same format.
```bash
phenopy score tests/data/test.score-short.txt --records-file tests/data/test.score-long.txt
```
Output:
```
#query entity_id score
118200 118200 0.0169
118200 300905 0.0156
118200 601098 0.0171
...
```
4. Score pairwise similarity of entities defined by the HPO terms from an input file using `--self`.
```bash
phenopy score tests/data/test.score-long.txt --threads 4 --self
```
Output:
```
#query entity_id score
118200 118200 0.2284
118200 118210 0.1302
118200 118211 0.1302
118210 118210 0.2048
118210 118211 0.2048
118211 118211 0.2048
```
5. Score age-adjusted pairwise similarity of entities defined in the input file,
using phenotype mean age and standard deviation defined in the `--ages_distribution_file`,
select best-match weighted average as the scoring summarization method `--summarization-method BMWA`.
```bash
phenopy score tests/data/test.score-short.txt --ages_distribution_file tests/data/phenotype_age.tsv --summarization-method BMWA --threads 4 --self
```
Output:
```
#query entity_id score
118200 210100 0.0
118200 177650 0.0127
118200 241520 0.0
...
```
The phenotype age file contains hpo-id, mean, sd as tab separated text as follows
| | | |
|------------|------|-----|
| HP:0001251 | 6.0 | 3.0 |
| HP:0001263 | 1.0 | 1.0 |
| HP:0001290 | 1.0 | 1.0 |
| HP:0004322 | 10.0 | 3.0 |
| HP:0001249 | 6.0 | 3.0 |
If no phenotype ages file is provided `--summarization-method=BMWA` can be selected to use default, open access literature-derived phenotype ages (~ 1,400 age weighted phenotypes).
```bash
phenopy score tests/data/test.score-short.txt --summarization-method BMWA --threads 4
```
#### Parameters
For a full list of command arguments use `phenopy [subcommand] --help`:
```bash
phenopy score --help
```
Output:
```
--output_file=OUTPUT_FILE
File path where to store the results. [default: - (stdout)]
--records_file=RECORDS_FILE
An entity-to-phenotype annotation file in the same format as "input_file". This file, if provided, is used to score entries in the "input_file" against entries here. [default: None]
--annotations_file=ANNOTATIONS_FILE
An entity-to-phenotype annotation file in the same format as "input_file". This file, if provided, is used to add information content to the network. [default: None]
--ages_distribution_file=AGES_DISTRIBUTION_FILE
Phenotypes age summary stats file containing phenotype HPO id, mean_age, and std. [default: None]
--self=SELF
Score entries in the "input_file" against itself.
--summarization_method=SUMMARIZATION_METHOD
The method used to summarize the HRSS matrix. Supported Values are best match average (BMA), best match weighted average (BMWA), and maximum (maximum). [default: BMWA]
--threads=THREADS
Number of parallel processes to use. [default: 1]
```
## Library Usage
The `phenopy` library can be used as a `Python` module, allowing more control for advanced users.
### score
**Generate the hpo network and supporting objects**:
```python
import os
from phenopy.build_hpo import generate_annotated_hpo_network
from phenopy.score import Scorer
# data directory
phenopy_data_directory = os.path.join(os.getenv('HOME'), '.phenopy/data')
# files used in building the annotated HPO network
obo_file = os.path.join(phenopy_data_directory, 'hp.obo')
disease_to_phenotype_file = os.path.join(phenopy_data_directory, 'phenotype.hpoa')
# if you have a custom ages_distribution_file, you can set it here.
ages_distribution_file = os.path.join(phenopy_data_directory, 'xa_age_stats_oct052019.tsv')
hpo_network, alt2prim, disease_records = \
generate_annotated_hpo_network(obo_file,
disease_to_phenotype_file,
ages_distribution_file=ages_distribution_file
)
```
**Then, instantiate the `Scorer` class and score hpo term lists.**
```python
scorer = Scorer(hpo_network)
terms_a = ['HP:0001263', 'HP:0011839']
terms_b = ['HP:0001263', 'HP:0000252']
print(scorer.score_term_sets_basic(terms_a, terms_b))
```
Output:
```
0.11213185474495047
```
### miscellaneous
The library can be used to prune parent phenotypes from the `phenotype.hpoa` and store pruned annotations as a file
```python
from phenopy.util import export_phenotype_hpoa_with_no_parents
# saves a new file of phenotype disease annotations with parent HPO terms removed from phenotype lists.
disease_to_phenotype_no_parents_file = os.path.join(phenopy_data_directory, 'phenotype.noparents.hpoa')
export_phenotype_hpoa_with_no_parents(disease_to_phenotype_file, disease_to_phenotype_no_parents_file, hpo_network)
```
## Initial setup
phenopy is designed to run with minimal setup from the user, to run phenopy with default parameters (recommended), skip ahead
to the [Commands overview](#Commands-overview).
This section provides details about where phenopy stores data resources and config files. The following occurs when
you run phenopy for the first time.
1. phenopy creates a `.phenopy/` directory in your home folder and downloads external resources from HPO into the
`$HOME/.phenopy/data/` directory.
2. phenopy creates a `$HOME/.phenopy/phenopy.ini` config file where users can set variables for phenopy to use
at runtime.
## Config
While we recommend using the default settings for most users, the config file *can be* modified: `$HOME/.phenopy/phenopy.ini`.
To run phenopy with a different version of `hp.obo`, set the path of `obo_file` in `$HOME/.phenopy/phenopy.ini`.
## Contributing
We welcome contributions from the community. Please follow these steps to setup a local development environment.
```bash
pipenv install --dev
```
To run tests locally:
```bash
pipenv shell
coverage run --source=. -m unittest discover --start-directory tests/
coverage report -m
```
## References
The underlying algorithm which determines the semantic similarity for any two HPO terms is based on an implementation of HRSS, [published here](https://www.ncbi.nlm.nih.gov/pubmed/23741529).
## Citing Phenopy
Please use the following Bibtex to cite this software.
```
@software{arvai_phenopy_2019,
title = {Phenopy},
rights = {Attribution-NonCommercial-ShareAlike 4.0 International},
url = {https://github.com/GeneDx/phenopy},
abstract = {Phenopy is a Python package to perform phenotype similarity scoring by semantic similarity.
Phenopy is a lightweight but highly optimized command line tool and library to efficiently perform semantic
similarity scoring on generic entities with phenotype annotations from the Human Phenotype Ontology (HPO).},
version = {0.3.0},
author = {Arvai, Kevin and Borroto, Carlos and Gainullin, Vladimir and Retterer, Kyle},
date = {2019-11-05},
year = {2019},
doi = {10.5281/zenodo.3529569}
}
```
Raw data
{
"_id": null,
"home_page": "",
"name": "phenopy",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.9,<4.0",
"maintainer_email": "",
"keywords": "",
"author": "Kevin Arvai",
"author_email": "",
"download_url": "https://files.pythonhosted.org/packages/e8/76/bc19daded003696e8e7b127bcccd914d78049adeef593db0e09de42e07aa/phenopy-0.6.0.tar.gz",
"platform": null,
"description": "[![python-version](https://img.shields.io/badge/python-3.6+-blue.svg)](https://www.python.org/downloads/release/python-360/)\n[![github-actions](https://github.com/GeneDx/phenopy/workflows/Python%20package/badge.svg)](https://github.com/GeneDx/phenopy/actions)\n[![codecov](https://codecov.io/gh/GeneDx/phenopy/branch/develop/graph/badge.svg)](https://codecov.io/gh/GeneDx/phenopy)\n[![DOI](https://zenodo.org/badge/207335538.svg)](https://zenodo.org/badge/latestdoi/207335538)\n\n# phenopy\n`phenopy` was developed using Python 3.9 and functions to perform phenotype similarity scoring by semantic similarity. `phenopy` is a\nlightweight but highly optimized command line tool and library to efficiently perform semantic similarity scoring on\ngeneric entities with phenotype annotations from the [Human Phenotype Ontology (HPO)](https://hpo.jax.org/app/).\n\n![Phenotype Similarity Clustering](https://raw.githubusercontent.com/GeneDx/phenopy/develop/notebooks/output/cluster_three_diseases.png)\n\n## Installation\nInstall using pip:\n```bash\npip install phenopy\n```\n\nInstall from GitHub:\n```bash\ngit clone https://github.com/GeneDx/phenopy.git\ncd phenopy\npipx install poetry\npoetry install\n```\n\n## Command Line Usage\n### score\n`phenopy` is primarily used as a command line tool. An entity, as described here, is presented as a sample, gene, or\ndisease, but could be any concept that warrants annotation of phenotype terms.\n\nUse `phenopy score` to perform semantic similarity scoring in various formats. Write the results of any command to file\nusing `--output-file=/path/to/output_file.txt`\n\n1. Score similarity of entities defined by the HPO terms from an input file against all the OMIM diseases in\n `.phenopy/data/phenotype.hpoa`. We provide a test input file in the repo. The default summarization method is to\n use `--summarization-method=BMWA` which weighs each diseases' phenotypes by the frequency of a phenotype seen in\n each particular disease.\n ```bash\n phenopy score tests/data/test.score.txt\n ```\n Output:\n ```\n #query\tentity_id\tscore\n 118200 210100 0.0\n 118200 615779 0.0\n 118200 613266 0.0052\n ...\n ```\n\n2. Score similarity of entities defined by the HPO terms from an input file against all the OMIM diseases in\n `.phenopy/data/phenotype.hpoa`, to use the non-weighted summarization method use `--summarization-method=BMA` which\n uses a traditional *best-match average* summarization of semantic similarity scores when comparing terms from record *a*\n with terms from record *b*.\n ```bash\n phenopy score tests/data/test.score.txt --summarization-method=BMWA\n ```\n Output:\n ```\n #query\tentity_id\tscore\n 118200 210100 0.0\n 118200 615779 0.0\n 118200 613266 0.0052\n ...\n ```\n\n3. Score similarity of an entities defined by the HPO terms from an input file against a custom list of entities with HPO annotations, referred to as the `--records-file`. Both files are in the same format.\n ```bash\n phenopy score tests/data/test.score-short.txt --records-file tests/data/test.score-long.txt\n ```\n Output:\n ```\n #query entity_id score\n 118200 118200 0.0169\n 118200 300905 0.0156\n 118200 601098 0.0171\n ...\n ```\n\n4. Score pairwise similarity of entities defined by the HPO terms from an input file using `--self`.\n\n ```bash\n phenopy score tests/data/test.score-long.txt --threads 4 --self\n ```\n Output:\n ```\n #query entity_id score\n 118200 118200 0.2284\n 118200 118210 0.1302\n 118200 118211 0.1302\n 118210 118210 0.2048\n 118210 118211 0.2048\n 118211 118211 0.2048\n ```\n5. Score age-adjusted pairwise similarity of entities defined in the input file,\n using phenotype mean age and standard deviation defined in the `--ages_distribution_file`,\n select best-match weighted average as the scoring summarization method `--summarization-method BMWA`.\n\n ```bash\n phenopy score tests/data/test.score-short.txt --ages_distribution_file tests/data/phenotype_age.tsv --summarization-method BMWA --threads 4 --self\n ```\n Output:\n ```\n #query entity_id score\n 118200 210100 0.0\n 118200 177650 0.0127\n 118200 241520 0.0\n ...\n ```\n\n The phenotype age file contains hpo-id, mean, sd as tab separated text as follows\n\n | | | |\n |------------|------|-----|\n | HP:0001251 | 6.0 | 3.0 |\n | HP:0001263 | 1.0 | 1.0 |\n | HP:0001290 | 1.0 | 1.0 |\n | HP:0004322 | 10.0 | 3.0 |\n | HP:0001249 | 6.0 | 3.0 |\n\n If no phenotype ages file is provided `--summarization-method=BMWA` can be selected to use default, open access literature-derived phenotype ages (~ 1,400 age weighted phenotypes).\n ```bash\n phenopy score tests/data/test.score-short.txt --summarization-method BMWA --threads 4\n ```\n\n#### Parameters\nFor a full list of command arguments use `phenopy [subcommand] --help`:\n```bash\nphenopy score --help\n```\nOutput:\n```\n --output_file=OUTPUT_FILE\n File path where to store the results. [default: - (stdout)]\n --records_file=RECORDS_FILE\n An entity-to-phenotype annotation file in the same format as \"input_file\". This file, if provided, is used to score entries in the \"input_file\" against entries here. [default: None]\n --annotations_file=ANNOTATIONS_FILE\n An entity-to-phenotype annotation file in the same format as \"input_file\". This file, if provided, is used to add information content to the network. [default: None]\n --ages_distribution_file=AGES_DISTRIBUTION_FILE\n Phenotypes age summary stats file containing phenotype HPO id, mean_age, and std. [default: None]\n --self=SELF\n Score entries in the \"input_file\" against itself.\n --summarization_method=SUMMARIZATION_METHOD\n The method used to summarize the HRSS matrix. Supported Values are best match average (BMA), best match weighted average (BMWA), and maximum (maximum). [default: BMWA]\n --threads=THREADS\n Number of parallel processes to use. [default: 1]\n```\n\n## Library Usage\n\nThe `phenopy` library can be used as a `Python` module, allowing more control for advanced users.\n\n### score\n\n**Generate the hpo network and supporting objects**:\n\n```python\nimport os\nfrom phenopy.build_hpo import generate_annotated_hpo_network\nfrom phenopy.score import Scorer\n\n# data directory\nphenopy_data_directory = os.path.join(os.getenv('HOME'), '.phenopy/data')\n\n# files used in building the annotated HPO network\nobo_file = os.path.join(phenopy_data_directory, 'hp.obo')\ndisease_to_phenotype_file = os.path.join(phenopy_data_directory, 'phenotype.hpoa')\n\n# if you have a custom ages_distribution_file, you can set it here.\nages_distribution_file = os.path.join(phenopy_data_directory, 'xa_age_stats_oct052019.tsv')\n\nhpo_network, alt2prim, disease_records = \\\n generate_annotated_hpo_network(obo_file,\n disease_to_phenotype_file,\n ages_distribution_file=ages_distribution_file\n )\n```\n\n**Then, instantiate the `Scorer` class and score hpo term lists.**\n\n```python\nscorer = Scorer(hpo_network)\n\nterms_a = ['HP:0001263', 'HP:0011839']\nterms_b = ['HP:0001263', 'HP:0000252']\n\nprint(scorer.score_term_sets_basic(terms_a, terms_b))\n```\n\nOutput:\n\n```\n0.11213185474495047\n```\n\n### miscellaneous\n\nThe library can be used to prune parent phenotypes from the `phenotype.hpoa` and store pruned annotations as a file\n\n```python\nfrom phenopy.util import export_phenotype_hpoa_with_no_parents\n# saves a new file of phenotype disease annotations with parent HPO terms removed from phenotype lists.\ndisease_to_phenotype_no_parents_file = os.path.join(phenopy_data_directory, 'phenotype.noparents.hpoa')\nexport_phenotype_hpoa_with_no_parents(disease_to_phenotype_file, disease_to_phenotype_no_parents_file, hpo_network)\n```\n\n\n## Initial setup\nphenopy is designed to run with minimal setup from the user, to run phenopy with default parameters (recommended), skip ahead\nto the [Commands overview](#Commands-overview).\n\nThis section provides details about where phenopy stores data resources and config files. The following occurs when\nyou run phenopy for the first time.\n 1. phenopy creates a `.phenopy/` directory in your home folder and downloads external resources from HPO into the\n `$HOME/.phenopy/data/` directory.\n 2. phenopy creates a `$HOME/.phenopy/phenopy.ini` config file where users can set variables for phenopy to use\n at runtime.\n\n## Config\nWhile we recommend using the default settings for most users, the config file *can be* modified: `$HOME/.phenopy/phenopy.ini`.\n\nTo run phenopy with a different version of `hp.obo`, set the path of `obo_file` in `$HOME/.phenopy/phenopy.ini`.\n\n## Contributing\nWe welcome contributions from the community. Please follow these steps to setup a local development environment.\n```bash\npipenv install --dev\n```\n\nTo run tests locally:\n```bash\npipenv shell\ncoverage run --source=. -m unittest discover --start-directory tests/\ncoverage report -m\n```\n\n## References\nThe underlying algorithm which determines the semantic similarity for any two HPO terms is based on an implementation of HRSS, [published here](https://www.ncbi.nlm.nih.gov/pubmed/23741529).\n\n## Citing Phenopy\nPlease use the following Bibtex to cite this software.\n```\n@software{arvai_phenopy_2019,\n title = {Phenopy},\n rights = {Attribution-NonCommercial-ShareAlike 4.0 International},\n url = {https://github.com/GeneDx/phenopy},\n abstract = {Phenopy is a Python package to perform phenotype similarity scoring by semantic similarity.\n Phenopy is a lightweight but highly optimized command line tool and library to efficiently perform semantic\n similarity scoring on generic entities with phenotype annotations from the Human Phenotype Ontology (HPO).},\n version = {0.3.0},\n author = {Arvai, Kevin and Borroto, Carlos and Gainullin, Vladimir and Retterer, Kyle},\n date = {2019-11-05},\n year = {2019},\n doi = {10.5281/zenodo.3529569}\n}\n```\n\n",
"bugtrack_url": null,
"license": "",
"summary": "Phenotype comparison scoring by semantic similarity.",
"version": "0.6.0",
"project_urls": {
"Bug Tracker": "https://github.com/GeneDx/phenopy/issues",
"homepage": "https://github.com/GeneDx/phenopy"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "db20aa90141aeafc20ebffb85092ea9f402d831e6498cd7188f8d9f12bbf603e",
"md5": "e9078e5a5c7a7e434130f05abfd991da",
"sha256": "14c68a5592d5b77a310e38b0ab1ff847c593bd71e38708e550b75d471d941b6d"
},
"downloads": -1,
"filename": "phenopy-0.6.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "e9078e5a5c7a7e434130f05abfd991da",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9,<4.0",
"size": 12951157,
"upload_time": "2023-06-17T03:54:27",
"upload_time_iso_8601": "2023-06-17T03:54:27.365503Z",
"url": "https://files.pythonhosted.org/packages/db/20/aa90141aeafc20ebffb85092ea9f402d831e6498cd7188f8d9f12bbf603e/phenopy-0.6.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "e876bc19daded003696e8e7b127bcccd914d78049adeef593db0e09de42e07aa",
"md5": "491e8228ae246071a7fccab214d5b249",
"sha256": "9f2975484c75346cd45b457ee5a45de846bb8d2da45ce4d92788cedb00f7e221"
},
"downloads": -1,
"filename": "phenopy-0.6.0.tar.gz",
"has_sig": false,
"md5_digest": "491e8228ae246071a7fccab214d5b249",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9,<4.0",
"size": 12952161,
"upload_time": "2023-06-17T03:54:30",
"upload_time_iso_8601": "2023-06-17T03:54:30.603453Z",
"url": "https://files.pythonhosted.org/packages/e8/76/bc19daded003696e8e7b127bcccd914d78049adeef593db0e09de42e07aa/phenopy-0.6.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-06-17 03:54:30",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "GeneDx",
"github_project": "phenopy",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "phenopy"
}