data-dictionary-cui-mapping


Namedata-dictionary-cui-mapping JSON
Version 1.1.6 PyPI version JSON
download
home_pagehttps://github.com/kevon217/data-dictionary-cui-mapping
SummaryThis package allows you to load in a data dictionary and map cuis to defined fields using either the UMLS API or MetaMap API from NLM, or a Semantic Search pipeline using Pinecone vector database.
upload_time2023-05-31 13:23:16
maintainer
docs_urlNone
authorKevin Armengol
requires_python>=3.8.1,<4.0.0
licenseMIT
keywords brics curation data dictionary umls metamap metathesaurus cui concept unique identifier nlm pubmedbert pritamdeka pritamdeka/pubmedbert-mnli-snli-scinli-scitail-mednli-stsb semantic search pinecone embeddings vector database
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # data-dictionary-cui-mapping

This package assists with mapping a user's data dictionary fields to [UMLS](https://www.nlm.nih.gov/research/umls/index.html) concepts. It is designed to be modular and flexible to allow for different configurations and use cases.

Roughly, the high-level steps are as follows:
- Configure yaml files
- Load in data dictionary
- Preprocess desired columns
- Query for UMLS concepts using any or all of the following pipeline modules:
  - **umls** (*UMLS API*)
  - **metamap** (*MetaMap API*)
  - **semantic_search** (*relies on access to a custom Pinecone vector database*)
  - **hydra_search** (*combines any combination of the above three modules*)
- Manually curate/select concepts in excel
- Create data dictionary file with new UMLS concept fields

## Prerequisites

- For UMLS API and MetaMap API, you will need to have an account with the UMLS API and/or MetaMap API. You can sign up for an account here: https://www.nlm.nih.gov/research/umls/index.html
- For Semantic Search with Pinecone, you will need to have an account with Pinecone. You can sign up for an account here: https://www.pinecone.io/. Please reach out to me if you would like temporary access to my Pinecone index to explore these embeddings.

## Installation

Use the package manager [pip](https://pip.pypa.io/en/stable/) to install [data-dictionary-cui-mapping](https://pypi.org/project/data-dictionary-cui-mapping/) from PyPI or pip install from the [GitHub repo](https://github.com/kevon217/data-dictionary-cui-mapping). The project uses [poetry](https://python-poetry.org/) for packaging and dependency management.

```bash
pip install data-dictionary-cui-mapping
#pip install git+https://github.com/kevon217/data-dictionary-cui-mapping.git
```

## Input: Data Dictionary

Below is a sample data dictionary format (*.csv*) that can be used as input for this package:

| variable name | title                  | permissible value descriptions |
| ------------- | ---------------------- |--------------------------------|
| AgeYrs        | Age in years           |                                |
| CaseContrlInd | Case control indicator | Case;Control;Unknown           |

## Configuration Files

In order to run and customize these pipelines, you will need to create/edit yaml configuration files located in configs. Run configurations are saved and can be reloaded.

```bash
├───ddcuimap
│   ├───configs
│   │   │   config.yaml
│   │   │   __init__.py
│   │   │
│   │   ├───apis
│   │   │       __init__.py
│   │   │       config_metamap_api.yaml
│   │   │       config_pinecone_api.yaml
│   │   │       config_umls_api.yaml
│   │   │
│   │   ├───custom
│   │   │       de.yaml
│   │   │       hydra_base.yaml
│   │   │       pvd.yaml
│   │   │       title_def.yaml
│   │   │
│   │   ├───semantic_search
│   │   │       embeddings.yaml
```

## CUI Batch Query Pipelines


### STEP-1A: RUN BATCH QUERY PIPELINE
###### IMPORT PACKAGES

```python
# from ddcuimap.umls import batch_query_pipeline as umls_bqp
# from ddcuimap.metamap import batch_query_pipeline as mm_bqp
# from ddcuimap.semantic_search import batch_hybrid_query_pipeline as ss_bqp
from ddcuimap.hydra_search import batch_hydra_query_pipeline as hs_bqp

from ddcuimap.utils import helper
from omegaconf import OmegaConf
```
###### LOAD/EDIT CONFIGURATION FILES
```python
cfg_hydra = helper.compose_config(overrides=["custom=hydra_base"])
# cfg_umls = helper.compose_config(overrides=["custom=de", "apis=config_umls_api"])
cfg_mm = helper.compose_config(overrides=["custom=de", "apis=config_metamap_api"])
cfg_ss = helper.compose_config(
    overrides=[
        "custom=title_def",
        "semantic_search=embeddings",
        "apis=config_pinecone_api",
    ]
)

# # UMLS API CREDENTIALS
# cfg_umls.apis.umls.user_info.apiKey = ''
# cfg_umls.apis.umls.user_info.email = ''

# # MetaMap API CREDENTIALS
# cfg_mm.apis.metamap.user_info.apiKey = ''
# cfg_mm.apis.metamap.user_info.email = ''
#
# # Pinecone API CREDENTIALS
# cfg_ss.apis.pinecone.index_info.apiKey = ''
# cfg_ss.apis.pinecone.index_info.environment = ''

print(OmegaConf.to_yaml(cfg_hydra))
```

###### RUN BATCH QUERY PIPELINE
```python
# df_umls, cfg_umls = umls_bqp.run_umls_batch(cfg_umls)
# df_mm, cfg_mm = mm_bqp.run_mm_batch(cfg_mm)
# df_ss, cfg_ss = ss_bqp.run_hybrid_ss_batch(cfg_ss)
df_hydra, cfg_step1 = hs_bqp.run_hydra_batch(cfg_hydra, cfg_umls=None, cfg_mm=cfg_mm, cfg_ss=cfg_ss)

print(df_hydra.head())
```

### STEP-1B: **MANUAL CURATION STEP IN EXCEL*

###### CURATION/SELECTION
*see curation example in ***notebooks/examples_files/DE_Step-1_curation_keepCol.xlsx***

### STEP-2A: CREATE DATA DICTIONARY IMPORT FILE

###### IMPORT CURATION MODULES
```python
from ddcuimap.curation import create_dictionary_import_file
from ddcuimap.curation import check_cuis
from ddcuimap.utils import helper
```
###### CREATE DATA DICTIONARY IMPORT FILE

```python
cfg_step1 = helper.load_config(helper.choose_file("Load config file from Step 1"))
df_dd = create_dictionary_import_file.create_dd_file(cfg_step1)
print(df_dd.head())
```

### STEP-2B: CHECK CUIS IN DATA DICTIONARY IMPORT FILE

###### CHECK CUIS
```python
cfg_step2 = helper.load_config(helper.choose_file("Load config file from Step 2"))
df_check = check_cuis.check_cuis(cfg_step2)
print(df_check.head())
```

## Output: Data Dictionary + CUIs
Below is a sample modified data dictionary with curated CUIs after:
1. Running Steps 1-2 on **title** then taking the generated output dictionary file and;
2. Running Steps 1-2 again on **permissible value descriptions** to get the final output dictionary file.

| variable name | title                  | data element concept identifiers | data element concept names | data element terminology sources | permissible values   | permissible value descriptions | permissible value output codes | permissible value concept identifiers | permissible value concept names           | permissible value terminology sources |
| ------------- | ---------------------- | -------------------------------- | -------------------------- | -------------------------------- | -------------------- | ------------------------------ | ------------------------------ | ------------------------------------- | ----------------------------------------- | ------------------------------------- |
| AgeYrs        | Age in years           | C1510829;C0001779                | Age-Years;Age              | UMLS;UMLS                        |                      |                                |                                |                                       |                                           |                                       |
| CaseContrlInd | Case control indicator | C0007328                         | Case-Control Studies       | UMLS                             | Case;Control;Unknown | Case;Control;Unknown           | 1;2;999                        | C1706256;C4553389;C0439673            | Clinical Study Case;Study Control;Unknown | UMLS;UMLS;UMLS                        |


## Semantic Search with SentenceTransformers Batch Queries
More documentation to come... Basic pipeline is described below:

### Subset/Embed/Upsert UMLS Metathesaurus for Pinecone vector database
#### Step 1: Subset local copy of UMLS Metathesaurus
#### Step 2: Embed UMLS CUI names and definitions and format metadata
#### Step 3: Upsert embeddings and metadata into Pinecone index

### Query UMLS Metathesaurus vector database with data dictionary embeddings
#### Step 1: Embed data dictionary fields
#### Step 2: Batch Query data dictionary against CUI names and definitions in Pinecone index
#### Step 3: Evaluate/Curate Results
#### Step 4: Create data dictionary based on curation


## Acknowledgements

The MetaMap API code included is from Will J Roger's repository --> https://github.com/lhncbc/skr_web_python_api

Special thanks to Olga Vovk, Henry Ogoe, and Sofia Syed for their guidance, feedback, and testing of this package.

## License

[MIT](https://choosealicense.com/licenses/mit/)

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/kevon217/data-dictionary-cui-mapping",
    "name": "data-dictionary-cui-mapping",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8.1,<4.0.0",
    "maintainer_email": "",
    "keywords": "BRICS,curation,data dictionary,UMLS,MetaMap,Metathesaurus,CUI,concept unique identifier,NLM,PubMedBERT,pritamdeka,pritamdeka/PubMedBERT-mnli-snli-scinli-scitail-mednli-stsb,semantic search,Pinecone,embeddings,vector database",
    "author": "Kevin Armengol",
    "author_email": "kevin.armengol@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/cd/f0/676fb8c7d91ffd616c4362925f5127ae970ea23de792d57378f514627b68/data_dictionary_cui_mapping-1.1.6.tar.gz",
    "platform": null,
    "description": "# data-dictionary-cui-mapping\n\nThis package assists with mapping a user's data dictionary fields to [UMLS](https://www.nlm.nih.gov/research/umls/index.html) concepts. It is designed to be modular and flexible to allow for different configurations and use cases.\n\nRoughly, the high-level steps are as follows:\n- Configure yaml files\n- Load in data dictionary\n- Preprocess desired columns\n- Query for UMLS concepts using any or all of the following pipeline modules:\n  - **umls** (*UMLS API*)\n  - **metamap** (*MetaMap API*)\n  - **semantic_search** (*relies on access to a custom Pinecone vector database*)\n  - **hydra_search** (*combines any combination of the above three modules*)\n- Manually curate/select concepts in excel\n- Create data dictionary file with new UMLS concept fields\n\n## Prerequisites\n\n- For UMLS API and MetaMap API, you will need to have an account with the UMLS API and/or MetaMap API. You can sign up for an account here: https://www.nlm.nih.gov/research/umls/index.html\n- For Semantic Search with Pinecone, you will need to have an account with Pinecone. You can sign up for an account here: https://www.pinecone.io/. Please reach out to me if you would like temporary access to my Pinecone index to explore these embeddings.\n\n## Installation\n\nUse the package manager [pip](https://pip.pypa.io/en/stable/) to install [data-dictionary-cui-mapping](https://pypi.org/project/data-dictionary-cui-mapping/) from PyPI or pip install from the [GitHub repo](https://github.com/kevon217/data-dictionary-cui-mapping). The project uses [poetry](https://python-poetry.org/) for packaging and dependency management.\n\n```bash\npip install data-dictionary-cui-mapping\n#pip install git+https://github.com/kevon217/data-dictionary-cui-mapping.git\n```\n\n## Input: Data Dictionary\n\nBelow is a sample data dictionary format (*.csv*) that can be used as input for this package:\n\n| variable name | title                  | permissible value descriptions |\n| ------------- | ---------------------- |--------------------------------|\n| AgeYrs        | Age in years           |                                |\n| CaseContrlInd | Case control indicator | Case;Control;Unknown           |\n\n## Configuration Files\n\nIn order to run and customize these pipelines, you will need to create/edit yaml configuration files located in configs. Run configurations are saved and can be reloaded.\n\n```bash\n\u251c\u2500\u2500\u2500ddcuimap\n\u2502   \u251c\u2500\u2500\u2500configs\n\u2502   \u2502   \u2502   config.yaml\n\u2502   \u2502   \u2502   __init__.py\n\u2502   \u2502   \u2502\n\u2502   \u2502   \u251c\u2500\u2500\u2500apis\n\u2502   \u2502   \u2502       __init__.py\n\u2502   \u2502   \u2502       config_metamap_api.yaml\n\u2502   \u2502   \u2502       config_pinecone_api.yaml\n\u2502   \u2502   \u2502       config_umls_api.yaml\n\u2502   \u2502   \u2502\n\u2502   \u2502   \u251c\u2500\u2500\u2500custom\n\u2502   \u2502   \u2502       de.yaml\n\u2502   \u2502   \u2502       hydra_base.yaml\n\u2502   \u2502   \u2502       pvd.yaml\n\u2502   \u2502   \u2502       title_def.yaml\n\u2502   \u2502   \u2502\n\u2502   \u2502   \u251c\u2500\u2500\u2500semantic_search\n\u2502   \u2502   \u2502       embeddings.yaml\n```\n\n## CUI Batch Query Pipelines\n\n\n### STEP-1A: RUN BATCH QUERY PIPELINE\n###### IMPORT PACKAGES\n\n```python\n# from ddcuimap.umls import batch_query_pipeline as umls_bqp\n# from ddcuimap.metamap import batch_query_pipeline as mm_bqp\n# from ddcuimap.semantic_search import batch_hybrid_query_pipeline as ss_bqp\nfrom ddcuimap.hydra_search import batch_hydra_query_pipeline as hs_bqp\n\nfrom ddcuimap.utils import helper\nfrom omegaconf import OmegaConf\n```\n###### LOAD/EDIT CONFIGURATION FILES\n```python\ncfg_hydra = helper.compose_config(overrides=[\"custom=hydra_base\"])\n# cfg_umls = helper.compose_config(overrides=[\"custom=de\", \"apis=config_umls_api\"])\ncfg_mm = helper.compose_config(overrides=[\"custom=de\", \"apis=config_metamap_api\"])\ncfg_ss = helper.compose_config(\n    overrides=[\n        \"custom=title_def\",\n        \"semantic_search=embeddings\",\n        \"apis=config_pinecone_api\",\n    ]\n)\n\n# # UMLS API CREDENTIALS\n# cfg_umls.apis.umls.user_info.apiKey = ''\n# cfg_umls.apis.umls.user_info.email = ''\n\n# # MetaMap API CREDENTIALS\n# cfg_mm.apis.metamap.user_info.apiKey = ''\n# cfg_mm.apis.metamap.user_info.email = ''\n#\n# # Pinecone API CREDENTIALS\n# cfg_ss.apis.pinecone.index_info.apiKey = ''\n# cfg_ss.apis.pinecone.index_info.environment = ''\n\nprint(OmegaConf.to_yaml(cfg_hydra))\n```\n\n###### RUN BATCH QUERY PIPELINE\n```python\n# df_umls, cfg_umls = umls_bqp.run_umls_batch(cfg_umls)\n# df_mm, cfg_mm = mm_bqp.run_mm_batch(cfg_mm)\n# df_ss, cfg_ss = ss_bqp.run_hybrid_ss_batch(cfg_ss)\ndf_hydra, cfg_step1 = hs_bqp.run_hydra_batch(cfg_hydra, cfg_umls=None, cfg_mm=cfg_mm, cfg_ss=cfg_ss)\n\nprint(df_hydra.head())\n```\n\n### STEP-1B: **MANUAL CURATION STEP IN EXCEL*\n\n###### CURATION/SELECTION\n*see curation example in ***notebooks/examples_files/DE_Step-1_curation_keepCol.xlsx***\n\n### STEP-2A: CREATE DATA DICTIONARY IMPORT FILE\n\n###### IMPORT CURATION MODULES\n```python\nfrom ddcuimap.curation import create_dictionary_import_file\nfrom ddcuimap.curation import check_cuis\nfrom ddcuimap.utils import helper\n```\n###### CREATE DATA DICTIONARY IMPORT FILE\n\n```python\ncfg_step1 = helper.load_config(helper.choose_file(\"Load config file from Step 1\"))\ndf_dd = create_dictionary_import_file.create_dd_file(cfg_step1)\nprint(df_dd.head())\n```\n\n### STEP-2B: CHECK CUIS IN DATA DICTIONARY IMPORT FILE\n\n###### CHECK CUIS\n```python\ncfg_step2 = helper.load_config(helper.choose_file(\"Load config file from Step 2\"))\ndf_check = check_cuis.check_cuis(cfg_step2)\nprint(df_check.head())\n```\n\n## Output: Data Dictionary + CUIs\nBelow is a sample modified data dictionary with curated CUIs after:\n1. Running Steps 1-2 on **title** then taking the generated output dictionary file and;\n2. Running Steps 1-2 again on **permissible value descriptions** to get the final output dictionary file.\n\n| variable name | title                  | data element concept identifiers | data element concept names | data element terminology sources | permissible values   | permissible value descriptions | permissible value output codes | permissible value concept identifiers | permissible value concept names           | permissible value terminology sources |\n| ------------- | ---------------------- | -------------------------------- | -------------------------- | -------------------------------- | -------------------- | ------------------------------ | ------------------------------ | ------------------------------------- | ----------------------------------------- | ------------------------------------- |\n| AgeYrs        | Age in years           | C1510829;C0001779                | Age-Years;Age              | UMLS;UMLS                        |                      |                                |                                |                                       |                                           |                                       |\n| CaseContrlInd | Case control indicator | C0007328                         | Case-Control Studies       | UMLS                             | Case;Control;Unknown | Case;Control;Unknown           | 1;2;999                        | C1706256;C4553389;C0439673            | Clinical Study Case;Study Control;Unknown | UMLS;UMLS;UMLS                        |\n\n\n## Semantic Search with SentenceTransformers Batch Queries\nMore documentation to come... Basic pipeline is described below:\n\n### Subset/Embed/Upsert UMLS Metathesaurus for Pinecone vector database\n#### Step 1: Subset local copy of UMLS Metathesaurus\n#### Step 2: Embed UMLS CUI names and definitions and format metadata\n#### Step 3: Upsert embeddings and metadata into Pinecone index\n\n### Query UMLS Metathesaurus vector database with data dictionary embeddings\n#### Step 1: Embed data dictionary fields\n#### Step 2: Batch Query data dictionary against CUI names and definitions in Pinecone index\n#### Step 3: Evaluate/Curate Results\n#### Step 4: Create data dictionary based on curation\n\n\n## Acknowledgements\n\nThe MetaMap API code included is from Will J Roger's repository --> https://github.com/lhncbc/skr_web_python_api\n\nSpecial thanks to Olga Vovk, Henry Ogoe, and Sofia Syed for their guidance, feedback, and testing of this package.\n\n## License\n\n[MIT](https://choosealicense.com/licenses/mit/)\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "This package allows you to load in a data dictionary and map cuis to defined fields using either the UMLS API or MetaMap API from NLM, or a Semantic Search pipeline using Pinecone vector database.",
    "version": "1.1.6",
    "project_urls": {
        "Homepage": "https://github.com/kevon217/data-dictionary-cui-mapping",
        "Repository": "https://github.com/kevon217/data-dictionary-cui-mapping"
    },
    "split_keywords": [
        "brics",
        "curation",
        "data dictionary",
        "umls",
        "metamap",
        "metathesaurus",
        "cui",
        "concept unique identifier",
        "nlm",
        "pubmedbert",
        "pritamdeka",
        "pritamdeka/pubmedbert-mnli-snli-scinli-scitail-mednli-stsb",
        "semantic search",
        "pinecone",
        "embeddings",
        "vector database"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "06ddc99f3c9813c8bf7fef520cef9ae4bf091d341d4205e7886c9902eac8b582",
                "md5": "0e54f2073a0f2c139ef3dca9abf383cf",
                "sha256": "b5ee87685bde59ba9f7e08cead3bc8fe93063300c1833fb0d54fd57d9b6856d4"
            },
            "downloads": -1,
            "filename": "data_dictionary_cui_mapping-1.1.6-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "0e54f2073a0f2c139ef3dca9abf383cf",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8.1,<4.0.0",
            "size": 20424719,
            "upload_time": "2023-05-31T13:23:11",
            "upload_time_iso_8601": "2023-05-31T13:23:11.200504Z",
            "url": "https://files.pythonhosted.org/packages/06/dd/c99f3c9813c8bf7fef520cef9ae4bf091d341d4205e7886c9902eac8b582/data_dictionary_cui_mapping-1.1.6-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "cdf0676fb8c7d91ffd616c4362925f5127ae970ea23de792d57378f514627b68",
                "md5": "0959e9788c5073139695e6d79fe05afc",
                "sha256": "22fdf3e48f05c44ae34c97a2c24dca296022dc65438e6bd805b162e1c711c84a"
            },
            "downloads": -1,
            "filename": "data_dictionary_cui_mapping-1.1.6.tar.gz",
            "has_sig": false,
            "md5_digest": "0959e9788c5073139695e6d79fe05afc",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8.1,<4.0.0",
            "size": 20322406,
            "upload_time": "2023-05-31T13:23:16",
            "upload_time_iso_8601": "2023-05-31T13:23:16.198429Z",
            "url": "https://files.pythonhosted.org/packages/cd/f0/676fb8c7d91ffd616c4362925f5127ae970ea23de792d57378f514627b68/data_dictionary_cui_mapping-1.1.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-05-31 13:23:16",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "kevon217",
    "github_project": "data-dictionary-cui-mapping",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "data-dictionary-cui-mapping"
}
        
Elapsed time: 0.11957s