# data-dictionary-cui-mapping
This package assists with mapping a user's data dictionary fields to [UMLS](https://www.nlm.nih.gov/research/umls/index.html) concepts. It is designed to be modular and flexible to allow for different configurations and use cases.
Roughly, the high-level steps are as follows:
- Configure yaml files
- Load in data dictionary
- Preprocess desired columns
- Query for UMLS concepts using any or all of the following pipeline modules:
- **umls** (*UMLS API*)
- **metamap** (*MetaMap API*)
- **semantic_search** (*relies on access to a custom Pinecone vector database*)
- **hydra_search** (*combines any combination of the above three modules*)
- Manually curate/select concepts in excel
- Create data dictionary file with new UMLS concept fields
## Prerequisites
- For UMLS API and MetaMap API, you will need to have an account with the UMLS API and/or MetaMap API. You can sign up for an account here: https://www.nlm.nih.gov/research/umls/index.html
- For Semantic Search with Pinecone, you will need to have an account with Pinecone. You can sign up for an account here: https://www.pinecone.io/. Please reach out to me if you would like temporary access to my Pinecone index to explore these embeddings.
## Installation
Use the package manager [pip](https://pip.pypa.io/en/stable/) to install [data-dictionary-cui-mapping](https://pypi.org/project/data-dictionary-cui-mapping/) from PyPI or pip install from the [GitHub repo](https://github.com/kevon217/data-dictionary-cui-mapping). The project uses [poetry](https://python-poetry.org/) for packaging and dependency management.
```bash
pip install data-dictionary-cui-mapping
#pip install git+https://github.com/kevon217/data-dictionary-cui-mapping.git
```
## Input: Data Dictionary
Below is a sample data dictionary format (*.csv*) that can be used as input for this package:
| variable name | title | permissible value descriptions |
| ------------- | ---------------------- |--------------------------------|
| AgeYrs | Age in years | |
| CaseContrlInd | Case control indicator | Case;Control;Unknown |
## Configuration Files
In order to run and customize these pipelines, you will need to create/edit yaml configuration files located in configs. Run configurations are saved and can be reloaded.
```bash
├───ddcuimap
│ ├───configs
│ │ │ config.yaml
│ │ │ __init__.py
│ │ │
│ │ ├───apis
│ │ │ __init__.py
│ │ │ config_metamap_api.yaml
│ │ │ config_pinecone_api.yaml
│ │ │ config_umls_api.yaml
│ │ │
│ │ ├───custom
│ │ │ de.yaml
│ │ │ hydra_base.yaml
│ │ │ pvd.yaml
│ │ │ title_def.yaml
│ │ │
│ │ ├───semantic_search
│ │ │ embeddings.yaml
```
## CUI Batch Query Pipelines
### STEP-1A: RUN BATCH QUERY PIPELINE
###### IMPORT PACKAGES
```python
# from ddcuimap.umls import batch_query_pipeline as umls_bqp
# from ddcuimap.metamap import batch_query_pipeline as mm_bqp
# from ddcuimap.semantic_search import batch_hybrid_query_pipeline as ss_bqp
from ddcuimap.hydra_search import batch_hydra_query_pipeline as hs_bqp
from ddcuimap.utils import helper
from omegaconf import OmegaConf
```
###### LOAD/EDIT CONFIGURATION FILES
```python
cfg_hydra = helper.compose_config(overrides=["custom=hydra_base"])
# cfg_umls = helper.compose_config(overrides=["custom=de", "apis=config_umls_api"])
cfg_mm = helper.compose_config(overrides=["custom=de", "apis=config_metamap_api"])
cfg_ss = helper.compose_config(
overrides=[
"custom=title_def",
"semantic_search=embeddings",
"apis=config_pinecone_api",
]
)
# # UMLS API CREDENTIALS
# cfg_umls.apis.umls.user_info.apiKey = ''
# cfg_umls.apis.umls.user_info.email = ''
# # MetaMap API CREDENTIALS
# cfg_mm.apis.metamap.user_info.apiKey = ''
# cfg_mm.apis.metamap.user_info.email = ''
#
# # Pinecone API CREDENTIALS
# cfg_ss.apis.pinecone.index_info.apiKey = ''
# cfg_ss.apis.pinecone.index_info.environment = ''
print(OmegaConf.to_yaml(cfg_hydra))
```
###### RUN BATCH QUERY PIPELINE
```python
# df_umls, cfg_umls = umls_bqp.run_umls_batch(cfg_umls)
# df_mm, cfg_mm = mm_bqp.run_mm_batch(cfg_mm)
# df_ss, cfg_ss = ss_bqp.run_hybrid_ss_batch(cfg_ss)
df_hydra, cfg_step1 = hs_bqp.run_hydra_batch(cfg_hydra, cfg_umls=None, cfg_mm=cfg_mm, cfg_ss=cfg_ss)
print(df_hydra.head())
```
### STEP-1B: **MANUAL CURATION STEP IN EXCEL*
###### CURATION/SELECTION
*see curation example in ***notebooks/examples_files/DE_Step-1_curation_keepCol.xlsx***
### STEP-2A: CREATE DATA DICTIONARY IMPORT FILE
###### IMPORT CURATION MODULES
```python
from ddcuimap.curation import create_dictionary_import_file
from ddcuimap.curation import check_cuis
from ddcuimap.utils import helper
```
###### CREATE DATA DICTIONARY IMPORT FILE
```python
cfg_step1 = helper.load_config(helper.choose_file("Load config file from Step 1"))
df_dd = create_dictionary_import_file.create_dd_file(cfg_step1)
print(df_dd.head())
```
### STEP-2B: CHECK CUIS IN DATA DICTIONARY IMPORT FILE
###### CHECK CUIS
```python
cfg_step2 = helper.load_config(helper.choose_file("Load config file from Step 2"))
df_check = check_cuis.check_cuis(cfg_step2)
print(df_check.head())
```
## Output: Data Dictionary + CUIs
Below is a sample modified data dictionary with curated CUIs after:
1. Running Steps 1-2 on **title** then taking the generated output dictionary file and;
2. Running Steps 1-2 again on **permissible value descriptions** to get the final output dictionary file.
| variable name | title | data element concept identifiers | data element concept names | data element terminology sources | permissible values | permissible value descriptions | permissible value output codes | permissible value concept identifiers | permissible value concept names | permissible value terminology sources |
| ------------- | ---------------------- | -------------------------------- | -------------------------- | -------------------------------- | -------------------- | ------------------------------ | ------------------------------ | ------------------------------------- | ----------------------------------------- | ------------------------------------- |
| AgeYrs | Age in years | C1510829;C0001779 | Age-Years;Age | UMLS;UMLS | | | | | | |
| CaseContrlInd | Case control indicator | C0007328 | Case-Control Studies | UMLS | Case;Control;Unknown | Case;Control;Unknown | 1;2;999 | C1706256;C4553389;C0439673 | Clinical Study Case;Study Control;Unknown | UMLS;UMLS;UMLS |
## Semantic Search with SentenceTransformers Batch Queries
More documentation to come... Basic pipeline is described below:
### Subset/Embed/Upsert UMLS Metathesaurus for Pinecone vector database
#### Step 1: Subset local copy of UMLS Metathesaurus
#### Step 2: Embed UMLS CUI names and definitions and format metadata
#### Step 3: Upsert embeddings and metadata into Pinecone index
### Query UMLS Metathesaurus vector database with data dictionary embeddings
#### Step 1: Embed data dictionary fields
#### Step 2: Batch Query data dictionary against CUI names and definitions in Pinecone index
#### Step 3: Evaluate/Curate Results
#### Step 4: Create data dictionary based on curation
## Acknowledgements
The MetaMap API code included is from Will J Roger's repository --> https://github.com/lhncbc/skr_web_python_api
Special thanks to Olga Vovk, Henry Ogoe, and Sofia Syed for their guidance, feedback, and testing of this package.
## License
[MIT](https://choosealicense.com/licenses/mit/)
Raw data
{
"_id": null,
"home_page": "https://github.com/kevon217/data-dictionary-cui-mapping",
"name": "data-dictionary-cui-mapping",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8.1,<4.0.0",
"maintainer_email": "",
"keywords": "BRICS,curation,data dictionary,UMLS,MetaMap,Metathesaurus,CUI,concept unique identifier,NLM,PubMedBERT,pritamdeka,pritamdeka/PubMedBERT-mnli-snli-scinli-scitail-mednli-stsb,semantic search,Pinecone,embeddings,vector database",
"author": "Kevin Armengol",
"author_email": "kevin.armengol@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/cd/f0/676fb8c7d91ffd616c4362925f5127ae970ea23de792d57378f514627b68/data_dictionary_cui_mapping-1.1.6.tar.gz",
"platform": null,
"description": "# data-dictionary-cui-mapping\n\nThis package assists with mapping a user's data dictionary fields to [UMLS](https://www.nlm.nih.gov/research/umls/index.html) concepts. It is designed to be modular and flexible to allow for different configurations and use cases.\n\nRoughly, the high-level steps are as follows:\n- Configure yaml files\n- Load in data dictionary\n- Preprocess desired columns\n- Query for UMLS concepts using any or all of the following pipeline modules:\n - **umls** (*UMLS API*)\n - **metamap** (*MetaMap API*)\n - **semantic_search** (*relies on access to a custom Pinecone vector database*)\n - **hydra_search** (*combines any combination of the above three modules*)\n- Manually curate/select concepts in excel\n- Create data dictionary file with new UMLS concept fields\n\n## Prerequisites\n\n- For UMLS API and MetaMap API, you will need to have an account with the UMLS API and/or MetaMap API. You can sign up for an account here: https://www.nlm.nih.gov/research/umls/index.html\n- For Semantic Search with Pinecone, you will need to have an account with Pinecone. You can sign up for an account here: https://www.pinecone.io/. Please reach out to me if you would like temporary access to my Pinecone index to explore these embeddings.\n\n## Installation\n\nUse the package manager [pip](https://pip.pypa.io/en/stable/) to install [data-dictionary-cui-mapping](https://pypi.org/project/data-dictionary-cui-mapping/) from PyPI or pip install from the [GitHub repo](https://github.com/kevon217/data-dictionary-cui-mapping). The project uses [poetry](https://python-poetry.org/) for packaging and dependency management.\n\n```bash\npip install data-dictionary-cui-mapping\n#pip install git+https://github.com/kevon217/data-dictionary-cui-mapping.git\n```\n\n## Input: Data Dictionary\n\nBelow is a sample data dictionary format (*.csv*) that can be used as input for this package:\n\n| variable name | title | permissible value descriptions |\n| ------------- | ---------------------- |--------------------------------|\n| AgeYrs | Age in years | |\n| CaseContrlInd | Case control indicator | Case;Control;Unknown |\n\n## Configuration Files\n\nIn order to run and customize these pipelines, you will need to create/edit yaml configuration files located in configs. Run configurations are saved and can be reloaded.\n\n```bash\n\u251c\u2500\u2500\u2500ddcuimap\n\u2502 \u251c\u2500\u2500\u2500configs\n\u2502 \u2502 \u2502 config.yaml\n\u2502 \u2502 \u2502 __init__.py\n\u2502 \u2502 \u2502\n\u2502 \u2502 \u251c\u2500\u2500\u2500apis\n\u2502 \u2502 \u2502 __init__.py\n\u2502 \u2502 \u2502 config_metamap_api.yaml\n\u2502 \u2502 \u2502 config_pinecone_api.yaml\n\u2502 \u2502 \u2502 config_umls_api.yaml\n\u2502 \u2502 \u2502\n\u2502 \u2502 \u251c\u2500\u2500\u2500custom\n\u2502 \u2502 \u2502 de.yaml\n\u2502 \u2502 \u2502 hydra_base.yaml\n\u2502 \u2502 \u2502 pvd.yaml\n\u2502 \u2502 \u2502 title_def.yaml\n\u2502 \u2502 \u2502\n\u2502 \u2502 \u251c\u2500\u2500\u2500semantic_search\n\u2502 \u2502 \u2502 embeddings.yaml\n```\n\n## CUI Batch Query Pipelines\n\n\n### STEP-1A: RUN BATCH QUERY PIPELINE\n###### IMPORT PACKAGES\n\n```python\n# from ddcuimap.umls import batch_query_pipeline as umls_bqp\n# from ddcuimap.metamap import batch_query_pipeline as mm_bqp\n# from ddcuimap.semantic_search import batch_hybrid_query_pipeline as ss_bqp\nfrom ddcuimap.hydra_search import batch_hydra_query_pipeline as hs_bqp\n\nfrom ddcuimap.utils import helper\nfrom omegaconf import OmegaConf\n```\n###### LOAD/EDIT CONFIGURATION FILES\n```python\ncfg_hydra = helper.compose_config(overrides=[\"custom=hydra_base\"])\n# cfg_umls = helper.compose_config(overrides=[\"custom=de\", \"apis=config_umls_api\"])\ncfg_mm = helper.compose_config(overrides=[\"custom=de\", \"apis=config_metamap_api\"])\ncfg_ss = helper.compose_config(\n overrides=[\n \"custom=title_def\",\n \"semantic_search=embeddings\",\n \"apis=config_pinecone_api\",\n ]\n)\n\n# # UMLS API CREDENTIALS\n# cfg_umls.apis.umls.user_info.apiKey = ''\n# cfg_umls.apis.umls.user_info.email = ''\n\n# # MetaMap API CREDENTIALS\n# cfg_mm.apis.metamap.user_info.apiKey = ''\n# cfg_mm.apis.metamap.user_info.email = ''\n#\n# # Pinecone API CREDENTIALS\n# cfg_ss.apis.pinecone.index_info.apiKey = ''\n# cfg_ss.apis.pinecone.index_info.environment = ''\n\nprint(OmegaConf.to_yaml(cfg_hydra))\n```\n\n###### RUN BATCH QUERY PIPELINE\n```python\n# df_umls, cfg_umls = umls_bqp.run_umls_batch(cfg_umls)\n# df_mm, cfg_mm = mm_bqp.run_mm_batch(cfg_mm)\n# df_ss, cfg_ss = ss_bqp.run_hybrid_ss_batch(cfg_ss)\ndf_hydra, cfg_step1 = hs_bqp.run_hydra_batch(cfg_hydra, cfg_umls=None, cfg_mm=cfg_mm, cfg_ss=cfg_ss)\n\nprint(df_hydra.head())\n```\n\n### STEP-1B: **MANUAL CURATION STEP IN EXCEL*\n\n###### CURATION/SELECTION\n*see curation example in ***notebooks/examples_files/DE_Step-1_curation_keepCol.xlsx***\n\n### STEP-2A: CREATE DATA DICTIONARY IMPORT FILE\n\n###### IMPORT CURATION MODULES\n```python\nfrom ddcuimap.curation import create_dictionary_import_file\nfrom ddcuimap.curation import check_cuis\nfrom ddcuimap.utils import helper\n```\n###### CREATE DATA DICTIONARY IMPORT FILE\n\n```python\ncfg_step1 = helper.load_config(helper.choose_file(\"Load config file from Step 1\"))\ndf_dd = create_dictionary_import_file.create_dd_file(cfg_step1)\nprint(df_dd.head())\n```\n\n### STEP-2B: CHECK CUIS IN DATA DICTIONARY IMPORT FILE\n\n###### CHECK CUIS\n```python\ncfg_step2 = helper.load_config(helper.choose_file(\"Load config file from Step 2\"))\ndf_check = check_cuis.check_cuis(cfg_step2)\nprint(df_check.head())\n```\n\n## Output: Data Dictionary + CUIs\nBelow is a sample modified data dictionary with curated CUIs after:\n1. Running Steps 1-2 on **title** then taking the generated output dictionary file and;\n2. Running Steps 1-2 again on **permissible value descriptions** to get the final output dictionary file.\n\n| variable name | title | data element concept identifiers | data element concept names | data element terminology sources | permissible values | permissible value descriptions | permissible value output codes | permissible value concept identifiers | permissible value concept names | permissible value terminology sources |\n| ------------- | ---------------------- | -------------------------------- | -------------------------- | -------------------------------- | -------------------- | ------------------------------ | ------------------------------ | ------------------------------------- | ----------------------------------------- | ------------------------------------- |\n| AgeYrs | Age in years | C1510829;C0001779 | Age-Years;Age | UMLS;UMLS | | | | | | |\n| CaseContrlInd | Case control indicator | C0007328 | Case-Control Studies | UMLS | Case;Control;Unknown | Case;Control;Unknown | 1;2;999 | C1706256;C4553389;C0439673 | Clinical Study Case;Study Control;Unknown | UMLS;UMLS;UMLS |\n\n\n## Semantic Search with SentenceTransformers Batch Queries\nMore documentation to come... Basic pipeline is described below:\n\n### Subset/Embed/Upsert UMLS Metathesaurus for Pinecone vector database\n#### Step 1: Subset local copy of UMLS Metathesaurus\n#### Step 2: Embed UMLS CUI names and definitions and format metadata\n#### Step 3: Upsert embeddings and metadata into Pinecone index\n\n### Query UMLS Metathesaurus vector database with data dictionary embeddings\n#### Step 1: Embed data dictionary fields\n#### Step 2: Batch Query data dictionary against CUI names and definitions in Pinecone index\n#### Step 3: Evaluate/Curate Results\n#### Step 4: Create data dictionary based on curation\n\n\n## Acknowledgements\n\nThe MetaMap API code included is from Will J Roger's repository --> https://github.com/lhncbc/skr_web_python_api\n\nSpecial thanks to Olga Vovk, Henry Ogoe, and Sofia Syed for their guidance, feedback, and testing of this package.\n\n## License\n\n[MIT](https://choosealicense.com/licenses/mit/)\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "This package allows you to load in a data dictionary and map cuis to defined fields using either the UMLS API or MetaMap API from NLM, or a Semantic Search pipeline using Pinecone vector database.",
"version": "1.1.6",
"project_urls": {
"Homepage": "https://github.com/kevon217/data-dictionary-cui-mapping",
"Repository": "https://github.com/kevon217/data-dictionary-cui-mapping"
},
"split_keywords": [
"brics",
"curation",
"data dictionary",
"umls",
"metamap",
"metathesaurus",
"cui",
"concept unique identifier",
"nlm",
"pubmedbert",
"pritamdeka",
"pritamdeka/pubmedbert-mnli-snli-scinli-scitail-mednli-stsb",
"semantic search",
"pinecone",
"embeddings",
"vector database"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "06ddc99f3c9813c8bf7fef520cef9ae4bf091d341d4205e7886c9902eac8b582",
"md5": "0e54f2073a0f2c139ef3dca9abf383cf",
"sha256": "b5ee87685bde59ba9f7e08cead3bc8fe93063300c1833fb0d54fd57d9b6856d4"
},
"downloads": -1,
"filename": "data_dictionary_cui_mapping-1.1.6-py3-none-any.whl",
"has_sig": false,
"md5_digest": "0e54f2073a0f2c139ef3dca9abf383cf",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8.1,<4.0.0",
"size": 20424719,
"upload_time": "2023-05-31T13:23:11",
"upload_time_iso_8601": "2023-05-31T13:23:11.200504Z",
"url": "https://files.pythonhosted.org/packages/06/dd/c99f3c9813c8bf7fef520cef9ae4bf091d341d4205e7886c9902eac8b582/data_dictionary_cui_mapping-1.1.6-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "cdf0676fb8c7d91ffd616c4362925f5127ae970ea23de792d57378f514627b68",
"md5": "0959e9788c5073139695e6d79fe05afc",
"sha256": "22fdf3e48f05c44ae34c97a2c24dca296022dc65438e6bd805b162e1c711c84a"
},
"downloads": -1,
"filename": "data_dictionary_cui_mapping-1.1.6.tar.gz",
"has_sig": false,
"md5_digest": "0959e9788c5073139695e6d79fe05afc",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8.1,<4.0.0",
"size": 20322406,
"upload_time": "2023-05-31T13:23:16",
"upload_time_iso_8601": "2023-05-31T13:23:16.198429Z",
"url": "https://files.pythonhosted.org/packages/cd/f0/676fb8c7d91ffd616c4362925f5127ae970ea23de792d57378f514627b68/data_dictionary_cui_mapping-1.1.6.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-05-31 13:23:16",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "kevon217",
"github_project": "data-dictionary-cui-mapping",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "data-dictionary-cui-mapping"
}