PyUMLS-Similarity


NamePyUMLS-Similarity JSON
Version 0.0.12 PyPI version JSON
download
home_page
SummaryThis package computes a variety of similarity metrics between concepts present in the UMLS database
upload_time2023-12-12 23:00:49
maintainer
docs_urlNone
author
requires_python>=3.10
license
keywords nlp semantic similarity medicine umls lobster thermidor
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ## Overview

This package computes a variety of semantic similarity metrics between concepts present in the UMLS (Unified Medical Language System) database. It serves as a Python wrapper based off the Perl modules ([UMLS Interface](https://metacpan.org/dist/UMLS-Interface) and [UMLS Similarity](https://metacpan.org/dist/UMLS-Similarity)) developed by Dr. Bridget McInnes and Dr. Ted Pedersen, offering an accessible and user-friendly interface for Python users.

Check out the documentation here: https://pyumls-similarity.readthedocs.io/en/latest/

## Available Similarity Measures

    * The basic path measure --> path
    * The undirected path measure --> upath
    * Leacock and Chodorow (1998) --> lch
    * Wu and Palmer (1994) --> wup
    * Zhong, et al. (2002) --> zhong
    * Rada, et. al. (1989) --> cdist
    * Nguyan and Al-Mubaid (2006) --> nam
    * Resnik (1996) --> res
    * Lin (1988) --> lin
    * Jiang and Conrath (1997) --> jcn
    * The vector measure --> vector
    * Pekar and Staab (2002) --> pks
    * Pirro and Euzenat (2010) --> faith
    * Maedche and Staab (2001) --> cmatch
    * Batet, et al (2011) --> batet
    * Sanchez, et al. (2012) --> sanchez

## Installation

To install PyUMLS_Similarity, run the following command:

```
pip install PyUMLS-Similarity
```

## Prerequisites

Before using the PyUMLS_Similarity package, ensure that you have the following prerequisites installed and set up:

### Strawberry Perl

The package requires Strawberry Perl to run Perl scripts. Download and install it from [Strawberry Perl's official website](http://strawberryperl.com/).

### MySQL

A local MySQL database instance is required to store and access UMLS data. Download and install MySQL from [MySQL's official download page](https://dev.mysql.com/downloads/mysql/). This package was tested on MySQL 8.1.0.

In order to work efficiently with the UMLS, you'll want to configure MySQL. A good starting point is to use the parameters designated by the UMLS found [here](https://www.nlm.nih.gov/research/umls/implementation_resources/scripts/README_ORF_MySQL_Output_Stream.html).

### UMLS Data

You need to have a local instance of the UMLS installed in MySQL. This involves downloading UMLS data and importing it into your MySQL database. Follow the guidelines provided by the UMLS for [obtaining a license](https://www.nlm.nih.gov/research/umls/index.html) and [downloading the UMLS data](https://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html).

### UMLS-Interface and UMLS-Similarity Perl Modules

The package depends on the UMLS-Interface and UMLS-Similarity Perl modules. If you are interested in using feature-based semantic similarity metrics you'll also want to download [WordNet](https://wordnet.princeton.edu/download/old-versions) and the associated Perl modules. After installing Strawberry Perl, install these modules using CPAN:

```
cpanm UMLS::Interface --force
cpanm UMLS::Similarity --force
cpanm WordNet::QueryData
cpanm WordNet::Similarity
```

## Usage

**IMPORTANT**: The first time you run a path based semantic similarity metric calculation, the UMLS Interface needs to create an index within MySQL of your UMLS instance for efficient pathing calculations in subsequent runs. This can be a long process depending on your machine hardware and your MySQL configuration. The default source vocabulary (SAB) is the Medical Subject Headings (MSH) in the UMLS Metathesaurus. Indexing this was relatively fast in my machine (a few minutes). It is possible to use/include other SABs as part of your UMLS Interface configuration like SNOMED, LOINC, CPT, etc. however, be warned that this will exponentially increase both the required memory for your process AND the time required for the indexing. For example, indexing SNOMED took about 2 days.   


Below are some examples of how to use the PyUMLS_Similarity package.

Start by initiating an instance of the PyUMLS_Similarity class:

```python 
from PyUMLS_Similarity import PyUMLS_Similarity

# define MySQL information that stores UMLS data in your computer
mysql_info = {}
mysql_info = {
    "username": "root",
    "password": "your_password",
    "hostname": "localhost",
    "socket": "MYSQL",
    "database": "umls"
}

umls_sim = PyUMLS_Similarity(mysql_info=mysql_info)

```

### Computing Multiple Similarity Metrics

You can compute similarity metrics between UMLS concepts as shown below. 

You can either provide a list of tuples contains the CUIs to be compared:

```python 
cui_pairs = [
    ('C0018563', 'C0037303'),
    ('C0035078', 'C0035078'),
]
```
Or you can provide a list of tuples containing the medical terms you want to be compare:

```python 
cui_pairs = [
    ('hand', 'skull'),
    ('Renal failure', 'Kidney failure'),
]
```

## Compute similarity using specific measures

```python 
measures = ['lch', 'wup']
similarity_df = umls_sim.similarity(cui_pairs, measures)

```

An example output would look something like this:
|    | Term 1        | Term 2        | CUI 1    | CUI 2    | lch   | wup   |
|----|---------------|---------------|----------|----------|-------|-------|
| 0  | hand          | skull         | C0018563 | C0037303 | 0.500 | 0.700 |
| 1  | Renal failure | Kidney failure| C0035078 | C0035078 | 1.000 | 1.000 |


### Finding Shortest Path

To find the shortest path between concepts:

```python 
shortest_path_df = umls_sim.find_shortest_path(cui_pairs)
```

An example output would look something like this:
|    | Term 1        | Term 2        | CUI 1    | CUI 2    | Path Length   | Path                                              |
|----|---------------|---------------|----------|----------|---------------|---------------------------------------------------|
| 0  | hand          | skull         | C0018563 | C0037303 |  9            | C0018563 => C1140618 => C0015385 => C0005898 =... |
| 1  | Renal failure | Kidney failure| C0035078 | C0035078 |  1            | C0035078 |

### Finding Least Common Subsumer

To find the least common subsumer (LCS) of concepts:

```python 
lcs_df = umls_sim.find_least_common_subsumer(cui_pairs)
```

An example output would look something like this:
|    | Term 1        | Term 2        | CUI 1    | CUI 2    | LCS                                | Min Depth | Max Depth |
|----|---------------|---------------|----------|----------|------------------------------------|-----------|-----------|
| 0  | hand          | skull         | C0018563 | C0037303 | Anatomy (MeSH Category) (C0002807) | 5         |      5    |
| 1  | Renal failure | Kidney failure| C0035078 | C0035078 | Renal failure (C0035078)           | 1         |      1    |

### Concurrency

PyUMLS_Similarity also supports running tasks concurrently for efficiency. Each time the Perl module is called it triggers a new connection to the database. This overhead is actually the most time consuming portion and running functions sequentially and/or separately adds up more and more overhead. To save time, I've made it so multiple functions can be run concurrently via Python's threading module. This essentially removes the overhead time of any additional function calls.

```python 
tasks = [
    {'function': 'similarity', 'arguments': (cui_pairs, measures)},
    {'function': 'shortest_path', 'arguments': (cui_pairs)},
    {'function': 'lcs', 'arguments': (cui_pairs)}
]

results = umls_sim.run_concurrently(tasks)
```

## Acknowledgements

This package is based on the Perl modules developed by Dr. Bridget McInnes and Dr. Ted Pedersen. The package umls-similarity by Donghua Chen also served as inspiration for this package.

## Future Developments
Future developments of this package will 

* allow for calculations of standard similarity metrics like cosine similarity, sorensen-dice index, jaccard similarity, and others
* allow for modifications of the UMLS Interface Configuration file

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "PyUMLS-Similarity",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": "Victor Murcia <victor.murciaruiz@va.gov>",
    "keywords": "NLP,semantic similarity,medicine,UMLS,Lobster Thermidor",
    "author": "",
    "author_email": "Victor Murcia <victor.murciaruiz@va.gov>",
    "download_url": "https://files.pythonhosted.org/packages/6c/fc/fad07984fc17b2639c4f72f2b3b86e87ebe21402693e12265962dcf536bf/PyUMLS_Similarity-0.0.12.tar.gz",
    "platform": null,
    "description": "## Overview\r\n\r\nThis package computes a variety of semantic similarity metrics between concepts present in the UMLS (Unified Medical Language System) database. It serves as a Python wrapper based off the Perl modules ([UMLS Interface](https://metacpan.org/dist/UMLS-Interface) and [UMLS Similarity](https://metacpan.org/dist/UMLS-Similarity)) developed by Dr. Bridget McInnes and Dr. Ted Pedersen, offering an accessible and user-friendly interface for Python users.\r\n\r\nCheck out the documentation here: https://pyumls-similarity.readthedocs.io/en/latest/\r\n\r\n## Available Similarity Measures\r\n\r\n    * The basic path measure --> path\r\n    * The undirected path measure --> upath\r\n    * Leacock and Chodorow (1998) --> lch\r\n    * Wu and Palmer (1994) --> wup\r\n    * Zhong, et al. (2002) --> zhong\r\n    * Rada, et. al. (1989) --> cdist\r\n    * Nguyan and Al-Mubaid (2006) --> nam\r\n    * Resnik (1996) --> res\r\n    * Lin (1988) --> lin\r\n    * Jiang and Conrath (1997) --> jcn\r\n    * The vector measure --> vector\r\n    * Pekar and Staab (2002) --> pks\r\n    * Pirro and Euzenat (2010) --> faith\r\n    * Maedche and Staab (2001) --> cmatch\r\n    * Batet, et al (2011) --> batet\r\n    * Sanchez, et al. (2012) --> sanchez\r\n\r\n## Installation\r\n\r\nTo install PyUMLS_Similarity, run the following command:\r\n\r\n```\r\npip install PyUMLS-Similarity\r\n```\r\n\r\n## Prerequisites\r\n\r\nBefore using the PyUMLS_Similarity package, ensure that you have the following prerequisites installed and set up:\r\n\r\n### Strawberry Perl\r\n\r\nThe package requires Strawberry Perl to run Perl scripts. Download and install it from [Strawberry Perl's official website](http://strawberryperl.com/).\r\n\r\n### MySQL\r\n\r\nA local MySQL database instance is required to store and access UMLS data. Download and install MySQL from [MySQL's official download page](https://dev.mysql.com/downloads/mysql/). This package was tested on MySQL 8.1.0.\r\n\r\nIn order to work efficiently with the UMLS, you'll want to configure MySQL. A good starting point is to use the parameters designated by the UMLS found [here](https://www.nlm.nih.gov/research/umls/implementation_resources/scripts/README_ORF_MySQL_Output_Stream.html).\r\n\r\n### UMLS Data\r\n\r\nYou need to have a local instance of the UMLS installed in MySQL. This involves downloading UMLS data and importing it into your MySQL database. Follow the guidelines provided by the UMLS for [obtaining a license](https://www.nlm.nih.gov/research/umls/index.html) and [downloading the UMLS data](https://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html).\r\n\r\n### UMLS-Interface and UMLS-Similarity Perl Modules\r\n\r\nThe package depends on the UMLS-Interface and UMLS-Similarity Perl modules. If you are interested in using feature-based semantic similarity metrics you'll also want to download [WordNet](https://wordnet.princeton.edu/download/old-versions) and the associated Perl modules. After installing Strawberry Perl, install these modules using CPAN:\r\n\r\n```\r\ncpanm UMLS::Interface --force\r\ncpanm UMLS::Similarity --force\r\ncpanm WordNet::QueryData\r\ncpanm WordNet::Similarity\r\n```\r\n\r\n## Usage\r\n\r\n**IMPORTANT**: The first time you run a path based semantic similarity metric calculation, the UMLS Interface needs to create an index within MySQL of your UMLS instance for efficient pathing calculations in subsequent runs. This can be a long process depending on your machine hardware and your MySQL configuration. The default source vocabulary (SAB) is the Medical Subject Headings (MSH) in the UMLS Metathesaurus. Indexing this was relatively fast in my machine (a few minutes). It is possible to use/include other SABs as part of your UMLS Interface configuration like SNOMED, LOINC, CPT, etc. however, be warned that this will exponentially increase both the required memory for your process AND the time required for the indexing. For example, indexing SNOMED took about 2 days.   \r\n\r\n\r\nBelow are some examples of how to use the PyUMLS_Similarity package.\r\n\r\nStart by initiating an instance of the PyUMLS_Similarity class:\r\n\r\n```python \r\nfrom PyUMLS_Similarity import PyUMLS_Similarity\r\n\r\n# define MySQL information that stores UMLS data in your computer\r\nmysql_info = {}\r\nmysql_info = {\r\n    \"username\": \"root\",\r\n    \"password\": \"your_password\",\r\n    \"hostname\": \"localhost\",\r\n    \"socket\": \"MYSQL\",\r\n    \"database\": \"umls\"\r\n}\r\n\r\numls_sim = PyUMLS_Similarity(mysql_info=mysql_info)\r\n\r\n```\r\n\r\n### Computing Multiple Similarity Metrics\r\n\r\nYou can compute similarity metrics between UMLS concepts as shown below. \r\n\r\nYou can either provide a list of tuples contains the CUIs to be compared:\r\n\r\n```python \r\ncui_pairs = [\r\n    ('C0018563', 'C0037303'),\r\n    ('C0035078', 'C0035078'),\r\n]\r\n```\r\nOr you can provide a list of tuples containing the medical terms you want to be compare:\r\n\r\n```python \r\ncui_pairs = [\r\n    ('hand', 'skull'),\r\n    ('Renal failure', 'Kidney failure'),\r\n]\r\n```\r\n\r\n## Compute similarity using specific measures\r\n\r\n```python \r\nmeasures = ['lch', 'wup']\r\nsimilarity_df = umls_sim.similarity(cui_pairs, measures)\r\n\r\n```\r\n\r\nAn example output would look something like this:\r\n|    | Term 1        | Term 2        | CUI 1    | CUI 2    | lch   | wup   |\r\n|----|---------------|---------------|----------|----------|-------|-------|\r\n| 0  | hand          | skull         | C0018563 | C0037303 | 0.500 | 0.700 |\r\n| 1  | Renal failure | Kidney failure| C0035078 | C0035078 | 1.000 | 1.000 |\r\n\r\n\r\n### Finding Shortest Path\r\n\r\nTo find the shortest path between concepts:\r\n\r\n```python \r\nshortest_path_df = umls_sim.find_shortest_path(cui_pairs)\r\n```\r\n\r\nAn example output would look something like this:\r\n|    | Term 1        | Term 2        | CUI 1    | CUI 2    | Path Length   | Path                                              |\r\n|----|---------------|---------------|----------|----------|---------------|---------------------------------------------------|\r\n| 0  | hand          | skull         | C0018563 | C0037303 |  9            | C0018563 => C1140618 => C0015385 => C0005898 =... |\r\n| 1  | Renal failure | Kidney failure| C0035078 | C0035078 |  1            | C0035078 |\r\n\r\n### Finding Least Common Subsumer\r\n\r\nTo find the least common subsumer (LCS) of concepts:\r\n\r\n```python \r\nlcs_df = umls_sim.find_least_common_subsumer(cui_pairs)\r\n```\r\n\r\nAn example output would look something like this:\r\n|    | Term 1        | Term 2        | CUI 1    | CUI 2    | LCS                                | Min Depth | Max Depth |\r\n|----|---------------|---------------|----------|----------|------------------------------------|-----------|-----------|\r\n| 0  | hand          | skull         | C0018563 | C0037303 | Anatomy (MeSH Category) (C0002807) | 5         |      5    |\r\n| 1  | Renal failure | Kidney failure| C0035078 | C0035078 | Renal failure (C0035078)           | 1         |      1    |\r\n\r\n### Concurrency\r\n\r\nPyUMLS_Similarity also supports running tasks concurrently for efficiency. Each time the Perl module is called it triggers a new connection to the database. This overhead is actually the most time consuming portion and running functions sequentially and/or separately adds up more and more overhead. To save time, I've made it so multiple functions can be run concurrently via Python's threading module. This essentially removes the overhead time of any additional function calls.\r\n\r\n```python \r\ntasks = [\r\n    {'function': 'similarity', 'arguments': (cui_pairs, measures)},\r\n    {'function': 'shortest_path', 'arguments': (cui_pairs)},\r\n    {'function': 'lcs', 'arguments': (cui_pairs)}\r\n]\r\n\r\nresults = umls_sim.run_concurrently(tasks)\r\n```\r\n\r\n## Acknowledgements\r\n\r\nThis package is based on the Perl modules developed by Dr. Bridget McInnes and Dr. Ted Pedersen. The package umls-similarity by Donghua Chen also served as inspiration for this package.\r\n\r\n## Future Developments\r\nFuture developments of this package will \r\n\r\n* allow for calculations of standard similarity metrics like cosine similarity, sorensen-dice index, jaccard similarity, and others\r\n* allow for modifications of the UMLS Interface Configuration file\r\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "This package computes a variety of similarity metrics between concepts present in the UMLS database",
    "version": "0.0.12",
    "project_urls": {
        "Homepage": "https://github.com/victormurcia/PyUMLS_Similarity",
        "Issues": "https://github.com/victormurcia/PyUMLS_Similarity/issues",
        "Repository": "https://github.com/victormurcia/PyUMLS_Similarity"
    },
    "split_keywords": [
        "nlp",
        "semantic similarity",
        "medicine",
        "umls",
        "lobster thermidor"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "007a573186076ffd56f2d209bf4e28ab9ad6ba1e07aacb40d78efea823403863",
                "md5": "a82d67550be46e979674ee2d002e1be2",
                "sha256": "43cb3abf4bd6c4a119ed8dd21ca7f0c239d3ce4712c2c982ff8d940351ad7122"
            },
            "downloads": -1,
            "filename": "PyUMLS_Similarity-0.0.12-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a82d67550be46e979674ee2d002e1be2",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 10655,
            "upload_time": "2023-12-12T23:00:48",
            "upload_time_iso_8601": "2023-12-12T23:00:48.287657Z",
            "url": "https://files.pythonhosted.org/packages/00/7a/573186076ffd56f2d209bf4e28ab9ad6ba1e07aacb40d78efea823403863/PyUMLS_Similarity-0.0.12-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6cfcfad07984fc17b2639c4f72f2b3b86e87ebe21402693e12265962dcf536bf",
                "md5": "4225f9b3d035d63f3c7272e7deb16c67",
                "sha256": "eeb150415b4f8765ee67c0883bf8ae2870dee5f64e31482af52ae28b5cf77085"
            },
            "downloads": -1,
            "filename": "PyUMLS_Similarity-0.0.12.tar.gz",
            "has_sig": false,
            "md5_digest": "4225f9b3d035d63f3c7272e7deb16c67",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 10055,
            "upload_time": "2023-12-12T23:00:49",
            "upload_time_iso_8601": "2023-12-12T23:00:49.467200Z",
            "url": "https://files.pythonhosted.org/packages/6c/fc/fad07984fc17b2639c4f72f2b3b86e87ebe21402693e12265962dcf536bf/PyUMLS_Similarity-0.0.12.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-12-12 23:00:49",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "victormurcia",
    "github_project": "PyUMLS_Similarity",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "pyumls-similarity"
}
        
Elapsed time: 0.15230s