phenopy


Namephenopy JSON
Version 0.6.0 PyPI version JSON
download
home_page
SummaryPhenotype comparison scoring by semantic similarity.
upload_time2023-06-17 03:54:30
maintainer
docs_urlNone
authorKevin Arvai
requires_python>=3.9,<4.0
license
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [![python-version](https://img.shields.io/badge/python-3.6+-blue.svg)](https://www.python.org/downloads/release/python-360/)
[![github-actions](https://github.com/GeneDx/phenopy/workflows/Python%20package/badge.svg)](https://github.com/GeneDx/phenopy/actions)
[![codecov](https://codecov.io/gh/GeneDx/phenopy/branch/develop/graph/badge.svg)](https://codecov.io/gh/GeneDx/phenopy)
[![DOI](https://zenodo.org/badge/207335538.svg)](https://zenodo.org/badge/latestdoi/207335538)

# phenopy
`phenopy` was developed using Python 3.9 and functions to perform phenotype similarity scoring by semantic similarity. `phenopy` is a
lightweight but highly optimized command line tool and library to efficiently perform semantic similarity scoring on
generic entities with phenotype annotations from the [Human Phenotype Ontology (HPO)](https://hpo.jax.org/app/).

![Phenotype Similarity Clustering](https://raw.githubusercontent.com/GeneDx/phenopy/develop/notebooks/output/cluster_three_diseases.png)

## Installation
Install using pip:
```bash
pip install phenopy
```

Install from GitHub:
```bash
git clone https://github.com/GeneDx/phenopy.git
cd phenopy
pipx install poetry
poetry install
```

## Command Line Usage
### score
`phenopy` is primarily used as a command line tool. An entity, as described here, is presented as a sample, gene, or
disease, but could be any concept that warrants annotation of phenotype terms.

Use `phenopy score` to perform semantic similarity scoring in various formats. Write the results of any command to file
using `--output-file=/path/to/output_file.txt`

1. Score similarity of entities defined by the HPO terms from an input file against all the OMIM diseases in
    `.phenopy/data/phenotype.hpoa`. We provide a test input file in the repo. The default summarization method is to
     use `--summarization-method=BMWA` which weighs each diseases' phenotypes by the frequency of a phenotype seen in
     each particular disease.
    ```bash
    phenopy score tests/data/test.score.txt
    ```
    Output:
    ```
    #query	entity_id	score
    118200  210100  0.0
    118200  615779  0.0
    118200  613266  0.0052
    ...
    ```

2. Score similarity of entities defined by the HPO terms from an input file against all the OMIM diseases in
    `.phenopy/data/phenotype.hpoa`, to use the non-weighted summarization method use `--summarization-method=BMA` which
    uses a traditional *best-match average* summarization of semantic similarity scores when comparing terms from record *a*
    with terms from record *b*.
    ```bash
    phenopy score tests/data/test.score.txt --summarization-method=BMWA
    ```
    Output:
    ```
    #query	entity_id	score
    118200  210100  0.0
    118200  615779  0.0
    118200  613266  0.0052
    ...
    ```

3. Score similarity of an entities defined by the HPO terms from an input file against a custom list of entities with HPO annotations, referred to as the `--records-file`. Both files are in the same format.
    ```bash
    phenopy score tests/data/test.score-short.txt --records-file tests/data/test.score-long.txt
    ```
    Output:
    ```
    #query  entity_id       score
    118200  118200  0.0169
    118200  300905  0.0156
    118200  601098  0.0171
    ...
    ```

4. Score pairwise similarity of entities defined by the HPO terms from an input file using `--self`.

    ```bash
    phenopy score tests/data/test.score-long.txt --threads 4 --self
    ```
    Output:
    ```
    #query  entity_id       score
    118200  118200  0.2284
    118200  118210  0.1302
    118200  118211  0.1302
    118210  118210  0.2048
    118210  118211  0.2048
    118211  118211  0.2048
    ```
5. Score age-adjusted pairwise similarity of entities defined in the input file,
    using phenotype mean age and standard deviation defined in the `--ages_distribution_file`,
    select best-match weighted average as the scoring summarization method `--summarization-method BMWA`.

    ```bash
    phenopy score tests/data/test.score-short.txt --ages_distribution_file tests/data/phenotype_age.tsv --summarization-method BMWA --threads 4 --self
    ```
    Output:
    ```
    #query  entity_id       score
    118200  210100  0.0
    118200  177650  0.0127
    118200  241520  0.0
    ...
    ```

    The phenotype age file contains hpo-id, mean, sd as tab separated text as follows

    |  |  | |
    |------------|------|-----|
    | HP:0001251 | 6.0  | 3.0 |
    | HP:0001263 | 1.0  | 1.0 |
    | HP:0001290 | 1.0  | 1.0 |
    | HP:0004322 | 10.0 | 3.0 |
    | HP:0001249 | 6.0  | 3.0 |

  If no phenotype ages file is provided `--summarization-method=BMWA` can be selected to use default, open access literature-derived phenotype ages (~ 1,400 age weighted phenotypes).
   ```bash
    phenopy score tests/data/test.score-short.txt  --summarization-method BMWA --threads 4
   ```

#### Parameters
For a full list of command arguments use `phenopy [subcommand] --help`:
```bash
phenopy score --help
```
Output:
```
    --output_file=OUTPUT_FILE
        File path where to store the results. [default: - (stdout)]
    --records_file=RECORDS_FILE
        An entity-to-phenotype annotation file in the same format as "input_file". This file, if provided, is used to score entries in the "input_file" against entries here. [default: None]
    --annotations_file=ANNOTATIONS_FILE
        An entity-to-phenotype annotation file in the same format as "input_file". This file, if provided, is used to add information content to the network. [default: None]
    --ages_distribution_file=AGES_DISTRIBUTION_FILE
        Phenotypes age summary stats file containing phenotype HPO id, mean_age, and std. [default: None]
    --self=SELF
        Score entries in the "input_file" against itself.
    --summarization_method=SUMMARIZATION_METHOD
        The method used to summarize the HRSS matrix. Supported Values are best match average (BMA), best match weighted average (BMWA), and maximum (maximum). [default: BMWA]
    --threads=THREADS
        Number of parallel processes to use. [default: 1]
```

## Library Usage

The `phenopy` library can be used as a `Python` module, allowing more control for advanced users.

### score

**Generate the hpo network and supporting objects**:

```python
import os
from phenopy.build_hpo import generate_annotated_hpo_network
from phenopy.score import Scorer

# data directory
phenopy_data_directory = os.path.join(os.getenv('HOME'), '.phenopy/data')

# files used in building the annotated HPO network
obo_file = os.path.join(phenopy_data_directory, 'hp.obo')
disease_to_phenotype_file = os.path.join(phenopy_data_directory, 'phenotype.hpoa')

# if you have a custom ages_distribution_file, you can set it here.
ages_distribution_file = os.path.join(phenopy_data_directory, 'xa_age_stats_oct052019.tsv')

hpo_network, alt2prim, disease_records = \
    generate_annotated_hpo_network(obo_file,
                                   disease_to_phenotype_file,
                                   ages_distribution_file=ages_distribution_file
                                   )
```

**Then, instantiate the `Scorer` class and score hpo term lists.**

```python
scorer = Scorer(hpo_network)

terms_a = ['HP:0001263', 'HP:0011839']
terms_b = ['HP:0001263', 'HP:0000252']

print(scorer.score_term_sets_basic(terms_a, terms_b))
```

Output:

```
0.11213185474495047
```

### miscellaneous

The library can be used to prune parent phenotypes from the `phenotype.hpoa` and store pruned annotations as a file

```python
from phenopy.util import export_phenotype_hpoa_with_no_parents
# saves a new file of phenotype disease annotations with parent HPO terms removed from phenotype lists.
disease_to_phenotype_no_parents_file = os.path.join(phenopy_data_directory, 'phenotype.noparents.hpoa')
export_phenotype_hpoa_with_no_parents(disease_to_phenotype_file, disease_to_phenotype_no_parents_file, hpo_network)
```


## Initial setup
phenopy is designed to run with minimal setup from the user, to run phenopy with default parameters (recommended), skip ahead
to the [Commands overview](#Commands-overview).

This section provides details about where phenopy stores data resources and config files. The following occurs when
you run phenopy for the first time.
 1. phenopy creates a `.phenopy/` directory in your home folder and downloads external resources from HPO into the
  `$HOME/.phenopy/data/` directory.
 2. phenopy creates a `$HOME/.phenopy/phenopy.ini` config file where users can set variables for phenopy to use
 at runtime.

## Config
While we recommend using the default settings for most users, the config file *can be* modified: `$HOME/.phenopy/phenopy.ini`.

To run phenopy with a different version of `hp.obo`, set the path of `obo_file` in `$HOME/.phenopy/phenopy.ini`.

## Contributing
We welcome contributions from the community. Please follow these steps to setup a local development environment.
```bash
pipenv install --dev
```

To run tests locally:
```bash
pipenv shell
coverage run --source=. -m unittest discover --start-directory tests/
coverage report -m
```

## References
The underlying algorithm which determines the semantic similarity for any two HPO terms is based on an implementation of HRSS, [published here](https://www.ncbi.nlm.nih.gov/pubmed/23741529).

## Citing Phenopy
Please use the following Bibtex to cite this software.
```
@software{arvai_phenopy_2019,
    title = {Phenopy},
    rights = {Attribution-NonCommercial-ShareAlike 4.0 International},
    url = {https://github.com/GeneDx/phenopy},
    abstract = {Phenopy is a Python package to perform phenotype similarity scoring by semantic similarity.
        Phenopy is a lightweight but highly optimized command line tool and library to efficiently perform semantic
        similarity scoring on generic entities with phenotype annotations from the Human Phenotype Ontology (HPO).},
    version = {0.3.0},
    author = {Arvai, Kevin and Borroto, Carlos and Gainullin, Vladimir and Retterer, Kyle},
    date = {2019-11-05},
    year = {2019},
    doi = {10.5281/zenodo.3529569}
}
```


            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "phenopy",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.9,<4.0",
    "maintainer_email": "",
    "keywords": "",
    "author": "Kevin Arvai",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/e8/76/bc19daded003696e8e7b127bcccd914d78049adeef593db0e09de42e07aa/phenopy-0.6.0.tar.gz",
    "platform": null,
    "description": "[![python-version](https://img.shields.io/badge/python-3.6+-blue.svg)](https://www.python.org/downloads/release/python-360/)\n[![github-actions](https://github.com/GeneDx/phenopy/workflows/Python%20package/badge.svg)](https://github.com/GeneDx/phenopy/actions)\n[![codecov](https://codecov.io/gh/GeneDx/phenopy/branch/develop/graph/badge.svg)](https://codecov.io/gh/GeneDx/phenopy)\n[![DOI](https://zenodo.org/badge/207335538.svg)](https://zenodo.org/badge/latestdoi/207335538)\n\n# phenopy\n`phenopy` was developed using Python 3.9 and functions to perform phenotype similarity scoring by semantic similarity. `phenopy` is a\nlightweight but highly optimized command line tool and library to efficiently perform semantic similarity scoring on\ngeneric entities with phenotype annotations from the [Human Phenotype Ontology (HPO)](https://hpo.jax.org/app/).\n\n![Phenotype Similarity Clustering](https://raw.githubusercontent.com/GeneDx/phenopy/develop/notebooks/output/cluster_three_diseases.png)\n\n## Installation\nInstall using pip:\n```bash\npip install phenopy\n```\n\nInstall from GitHub:\n```bash\ngit clone https://github.com/GeneDx/phenopy.git\ncd phenopy\npipx install poetry\npoetry install\n```\n\n## Command Line Usage\n### score\n`phenopy` is primarily used as a command line tool. An entity, as described here, is presented as a sample, gene, or\ndisease, but could be any concept that warrants annotation of phenotype terms.\n\nUse `phenopy score` to perform semantic similarity scoring in various formats. Write the results of any command to file\nusing `--output-file=/path/to/output_file.txt`\n\n1. Score similarity of entities defined by the HPO terms from an input file against all the OMIM diseases in\n    `.phenopy/data/phenotype.hpoa`. We provide a test input file in the repo. The default summarization method is to\n     use `--summarization-method=BMWA` which weighs each diseases' phenotypes by the frequency of a phenotype seen in\n     each particular disease.\n    ```bash\n    phenopy score tests/data/test.score.txt\n    ```\n    Output:\n    ```\n    #query\tentity_id\tscore\n    118200  210100  0.0\n    118200  615779  0.0\n    118200  613266  0.0052\n    ...\n    ```\n\n2. Score similarity of entities defined by the HPO terms from an input file against all the OMIM diseases in\n    `.phenopy/data/phenotype.hpoa`, to use the non-weighted summarization method use `--summarization-method=BMA` which\n    uses a traditional *best-match average* summarization of semantic similarity scores when comparing terms from record *a*\n    with terms from record *b*.\n    ```bash\n    phenopy score tests/data/test.score.txt --summarization-method=BMWA\n    ```\n    Output:\n    ```\n    #query\tentity_id\tscore\n    118200  210100  0.0\n    118200  615779  0.0\n    118200  613266  0.0052\n    ...\n    ```\n\n3. Score similarity of an entities defined by the HPO terms from an input file against a custom list of entities with HPO annotations, referred to as the `--records-file`. Both files are in the same format.\n    ```bash\n    phenopy score tests/data/test.score-short.txt --records-file tests/data/test.score-long.txt\n    ```\n    Output:\n    ```\n    #query  entity_id       score\n    118200  118200  0.0169\n    118200  300905  0.0156\n    118200  601098  0.0171\n    ...\n    ```\n\n4. Score pairwise similarity of entities defined by the HPO terms from an input file using `--self`.\n\n    ```bash\n    phenopy score tests/data/test.score-long.txt --threads 4 --self\n    ```\n    Output:\n    ```\n    #query  entity_id       score\n    118200  118200  0.2284\n    118200  118210  0.1302\n    118200  118211  0.1302\n    118210  118210  0.2048\n    118210  118211  0.2048\n    118211  118211  0.2048\n    ```\n5. Score age-adjusted pairwise similarity of entities defined in the input file,\n    using phenotype mean age and standard deviation defined in the `--ages_distribution_file`,\n    select best-match weighted average as the scoring summarization method `--summarization-method BMWA`.\n\n    ```bash\n    phenopy score tests/data/test.score-short.txt --ages_distribution_file tests/data/phenotype_age.tsv --summarization-method BMWA --threads 4 --self\n    ```\n    Output:\n    ```\n    #query  entity_id       score\n    118200  210100  0.0\n    118200  177650  0.0127\n    118200  241520  0.0\n    ...\n    ```\n\n    The phenotype age file contains hpo-id, mean, sd as tab separated text as follows\n\n    |  |  | |\n    |------------|------|-----|\n    | HP:0001251 | 6.0  | 3.0 |\n    | HP:0001263 | 1.0  | 1.0 |\n    | HP:0001290 | 1.0  | 1.0 |\n    | HP:0004322 | 10.0 | 3.0 |\n    | HP:0001249 | 6.0  | 3.0 |\n\n  If no phenotype ages file is provided `--summarization-method=BMWA` can be selected to use default, open access literature-derived phenotype ages (~ 1,400 age weighted phenotypes).\n   ```bash\n    phenopy score tests/data/test.score-short.txt  --summarization-method BMWA --threads 4\n   ```\n\n#### Parameters\nFor a full list of command arguments use `phenopy [subcommand] --help`:\n```bash\nphenopy score --help\n```\nOutput:\n```\n    --output_file=OUTPUT_FILE\n        File path where to store the results. [default: - (stdout)]\n    --records_file=RECORDS_FILE\n        An entity-to-phenotype annotation file in the same format as \"input_file\". This file, if provided, is used to score entries in the \"input_file\" against entries here. [default: None]\n    --annotations_file=ANNOTATIONS_FILE\n        An entity-to-phenotype annotation file in the same format as \"input_file\". This file, if provided, is used to add information content to the network. [default: None]\n    --ages_distribution_file=AGES_DISTRIBUTION_FILE\n        Phenotypes age summary stats file containing phenotype HPO id, mean_age, and std. [default: None]\n    --self=SELF\n        Score entries in the \"input_file\" against itself.\n    --summarization_method=SUMMARIZATION_METHOD\n        The method used to summarize the HRSS matrix. Supported Values are best match average (BMA), best match weighted average (BMWA), and maximum (maximum). [default: BMWA]\n    --threads=THREADS\n        Number of parallel processes to use. [default: 1]\n```\n\n## Library Usage\n\nThe `phenopy` library can be used as a `Python` module, allowing more control for advanced users.\n\n### score\n\n**Generate the hpo network and supporting objects**:\n\n```python\nimport os\nfrom phenopy.build_hpo import generate_annotated_hpo_network\nfrom phenopy.score import Scorer\n\n# data directory\nphenopy_data_directory = os.path.join(os.getenv('HOME'), '.phenopy/data')\n\n# files used in building the annotated HPO network\nobo_file = os.path.join(phenopy_data_directory, 'hp.obo')\ndisease_to_phenotype_file = os.path.join(phenopy_data_directory, 'phenotype.hpoa')\n\n# if you have a custom ages_distribution_file, you can set it here.\nages_distribution_file = os.path.join(phenopy_data_directory, 'xa_age_stats_oct052019.tsv')\n\nhpo_network, alt2prim, disease_records = \\\n    generate_annotated_hpo_network(obo_file,\n                                   disease_to_phenotype_file,\n                                   ages_distribution_file=ages_distribution_file\n                                   )\n```\n\n**Then, instantiate the `Scorer` class and score hpo term lists.**\n\n```python\nscorer = Scorer(hpo_network)\n\nterms_a = ['HP:0001263', 'HP:0011839']\nterms_b = ['HP:0001263', 'HP:0000252']\n\nprint(scorer.score_term_sets_basic(terms_a, terms_b))\n```\n\nOutput:\n\n```\n0.11213185474495047\n```\n\n### miscellaneous\n\nThe library can be used to prune parent phenotypes from the `phenotype.hpoa` and store pruned annotations as a file\n\n```python\nfrom phenopy.util import export_phenotype_hpoa_with_no_parents\n# saves a new file of phenotype disease annotations with parent HPO terms removed from phenotype lists.\ndisease_to_phenotype_no_parents_file = os.path.join(phenopy_data_directory, 'phenotype.noparents.hpoa')\nexport_phenotype_hpoa_with_no_parents(disease_to_phenotype_file, disease_to_phenotype_no_parents_file, hpo_network)\n```\n\n\n## Initial setup\nphenopy is designed to run with minimal setup from the user, to run phenopy with default parameters (recommended), skip ahead\nto the [Commands overview](#Commands-overview).\n\nThis section provides details about where phenopy stores data resources and config files. The following occurs when\nyou run phenopy for the first time.\n 1. phenopy creates a `.phenopy/` directory in your home folder and downloads external resources from HPO into the\n  `$HOME/.phenopy/data/` directory.\n 2. phenopy creates a `$HOME/.phenopy/phenopy.ini` config file where users can set variables for phenopy to use\n at runtime.\n\n## Config\nWhile we recommend using the default settings for most users, the config file *can be* modified: `$HOME/.phenopy/phenopy.ini`.\n\nTo run phenopy with a different version of `hp.obo`, set the path of `obo_file` in `$HOME/.phenopy/phenopy.ini`.\n\n## Contributing\nWe welcome contributions from the community. Please follow these steps to setup a local development environment.\n```bash\npipenv install --dev\n```\n\nTo run tests locally:\n```bash\npipenv shell\ncoverage run --source=. -m unittest discover --start-directory tests/\ncoverage report -m\n```\n\n## References\nThe underlying algorithm which determines the semantic similarity for any two HPO terms is based on an implementation of HRSS, [published here](https://www.ncbi.nlm.nih.gov/pubmed/23741529).\n\n## Citing Phenopy\nPlease use the following Bibtex to cite this software.\n```\n@software{arvai_phenopy_2019,\n    title = {Phenopy},\n    rights = {Attribution-NonCommercial-ShareAlike 4.0 International},\n    url = {https://github.com/GeneDx/phenopy},\n    abstract = {Phenopy is a Python package to perform phenotype similarity scoring by semantic similarity.\n        Phenopy is a lightweight but highly optimized command line tool and library to efficiently perform semantic\n        similarity scoring on generic entities with phenotype annotations from the Human Phenotype Ontology (HPO).},\n    version = {0.3.0},\n    author = {Arvai, Kevin and Borroto, Carlos and Gainullin, Vladimir and Retterer, Kyle},\n    date = {2019-11-05},\n    year = {2019},\n    doi = {10.5281/zenodo.3529569}\n}\n```\n\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Phenotype comparison scoring by semantic similarity.",
    "version": "0.6.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/GeneDx/phenopy/issues",
        "homepage": "https://github.com/GeneDx/phenopy"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "db20aa90141aeafc20ebffb85092ea9f402d831e6498cd7188f8d9f12bbf603e",
                "md5": "e9078e5a5c7a7e434130f05abfd991da",
                "sha256": "14c68a5592d5b77a310e38b0ab1ff847c593bd71e38708e550b75d471d941b6d"
            },
            "downloads": -1,
            "filename": "phenopy-0.6.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e9078e5a5c7a7e434130f05abfd991da",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9,<4.0",
            "size": 12951157,
            "upload_time": "2023-06-17T03:54:27",
            "upload_time_iso_8601": "2023-06-17T03:54:27.365503Z",
            "url": "https://files.pythonhosted.org/packages/db/20/aa90141aeafc20ebffb85092ea9f402d831e6498cd7188f8d9f12bbf603e/phenopy-0.6.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e876bc19daded003696e8e7b127bcccd914d78049adeef593db0e09de42e07aa",
                "md5": "491e8228ae246071a7fccab214d5b249",
                "sha256": "9f2975484c75346cd45b457ee5a45de846bb8d2da45ce4d92788cedb00f7e221"
            },
            "downloads": -1,
            "filename": "phenopy-0.6.0.tar.gz",
            "has_sig": false,
            "md5_digest": "491e8228ae246071a7fccab214d5b249",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9,<4.0",
            "size": 12952161,
            "upload_time": "2023-06-17T03:54:30",
            "upload_time_iso_8601": "2023-06-17T03:54:30.603453Z",
            "url": "https://files.pythonhosted.org/packages/e8/76/bc19daded003696e8e7b127bcccd914d78049adeef593db0e09de42e07aa/phenopy-0.6.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-06-17 03:54:30",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "GeneDx",
    "github_project": "phenopy",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "phenopy"
}
        
Elapsed time: 0.08417s