bioel


Namebioel JSON
Version 0.1.13 PyPI version JSON
download
home_pagehttps://github.com/davidkartchner/biomedical-entity-linking/tree/main/bioel
SummaryAn easy-to-use package for all your biomedical entity linking needs.
upload_time2025-02-08 16:14:30
maintainerNone
docs_urlNone
authorPathology Dynamics Lab, Georgia Institute of Technology
requires_python>=3.9
licenseMIT
keywords bioel biomedical entity linking entity-linking biomedical-entity-linking
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # BioEL: A comprehensive package for training, evaluating, and benchmarking biomedical entity linking models.

## Installation



```bash
conda create -n bioel python=3.9
conda activate bioel
pip install bioel

python3 -m pip install pip==24.0

git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./

pip install numpy==1.23
```

Biobart and Biogenel require transformers version 4.44.2. This specific version of transformers depends on functionality available in numpy version 1.23.

Currently, installing `fairseq` requires cloning the repository directly, as we encountered an issue when attempting installation via pip. We have reported the issue to the `fairseq` team, and once it's resolved, we will update our setup accordingly.

## Development Instructions

1. Install it as an editable package using `pip` as shown above.
1. Add any new dependencies to `setup.py`.
1. Add tests to `tests/` directory.

## Ontologies
Ontologies included in the package : 

- UMLS : UMLS is licensed by the National Library of Medicine and requires a free account to download. You can sign up for an account at https://uts.nlm.nih.gov/uts/signup-login. Once your account has been approved, you can download the UMLS metathesaurus at https://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html.

- MeSH : MeSH is derived from UMLS

- NCBI Gene (Entrez) ontololgy. It can be downloaded at https://ftp.ncbi.nih.gov/gene/DATA/
(gene_info.gz)

- MEDIC (MErged DIsease voCabulary) : It can be downloaded at https://ctdbase.org/downloads/

## Resolving abbreviations
As a preprocessing step, we resolve abbreviations in the text using Ab3P, an abbreviation detector created for biomedical text. We ran abbreviation detection on the text of all documents in our benchmark, the results of which are stored in a large dictionary in `data/abbreviations.json`. In order to reproduce our abbreviation detection/resolution pipeline, please run the following:

```
from bioel.utils.solve_abbreviation.solve_abbreviations import create_abbrev
create_abbrev(output_dir, all_dataset)
# output_path : path where to create abbreviations.json
# all_dataset : datasets for which you want the abbreviations.
```

## Example Usage: Model Training, Inference, and Evaluation
```
# Import modules
from bioel.model import BioEL_Model
from bioel.evaluate import Evaluate

# 1) load model
# Look at data/params.json for more information about the config file
krissbert = BioEL_Model.load_krissbert(
        name="krissbert",
        params_file='path/to/params_krissbert.json", # config file
        # checkpoint_path="path/to/checkpoint" # if you have an already trained model
    )

# 2) Training and inference
krissbert.training() # Train
krissbert.inference() # Inference

abbreviations_path = "data/abbreviations.json"
dataset_names = ["ncbi_disease"]
model_names = ["krissbert"]
path_to_result = {
    "ncbi_disease": {
        "krissbert": "results/ncbi_disease.json"
    }
}
eval_strategies=["basic"]

# 3) Results
evaluator = Evaluate(dataset_names=dataset_names, 
                     model_names=model_names, 
                     path_to_result=path_to_result, 
                     abbreviations_path=abbreviations_path, 
                     eval_strategies=eval_strategies,
                     max_k=10, # number of candidates to consider
                     )
evaluator.load_results()
evaluator.process_datasets()
evaluator.evaluate()
evaluator.plot_results()
evaluator.detailed_results()
```
1. Load the Model: Use the `BioEL_Model` class to load your model. Ensure you have a configuration file (params.json) ready.
2. Training and Inference: Perform training and inference using the training() and inference() methods.
3. Evaluation:
        Run evaluations for all models across all datasets using the `Evaluate` class.
        For error analysis with hit index details, use the `evaluator.error_analysis_dfs` attribute.
        For detailed performance metrics such as failure stage breakdown, accuracy per type, recall@k per type, MAP@k, and statistical significance (p-values), refer to the `evaluator.detailed_results_analysis` attribute.

## Config files
Example of configuration files for the various models are provided in the `data/` directory. These can serve as references for users to create or modify their own configuration files.

## Load the different datasets
```
from bioel.dataset import Dataset
abbreviations_path = "data/abbreviations.json"
dataset_name = "bc5cdr" # Specify the desired dataset name from the BigBio collection here.
dataset = Dataset(
    dataset_name = dataset_name, abbreviations_path=abbreviations_path
)

```


## Load the different ontologies

```
from bioel.ontology import BiomedicalOntology

##### ----------------- Load medic ----------------- #####

dataset_name = 'ncbi_disease'
medic_dict = {"name" : "medic",
            "filepath" : "path/to/medic"} # medic.tsv file

ontology = BiomedicalOntology.load_medic(**medic_dict)

##### ----------------- Load entrez ----------------- #####

dataset_name = "gnormplus" # or "nlm_gene"

entrez_dict = {"name" : "entrez",
             "filepath" : "path/to/entrez", # gene_info.tsv file
             "dataset" : f"{dataset_name}",}
ontology = BiomedicalOntology.load_entrez(**entrez_dict)

##### ----------------- Load MESH ----------------- #####

dataset_name = "nlmchem"
mesh_dict = {"name" : "mesh",
             "filepath" : "path/to/umls"}
ontology = BiomedicalOntology.load_mesh(**mesh_dict)

##### ----------------- Load UMLS (st21pv subset) ----------------- #####

dataset_name = "medmentions_st21pv"
umls_dict_st21pv = {
    "name": "umls",
    "filepath": "path/to/umls",
    "path_st21pv_cui": "data/umls_cuis_st21pv.json",
}
ontology = BiomedicalOntology.load_umls(**umls_dict_st21pv)

##### ----------------- Load UMLS (full) ----------------- #####

dataset_name = "medmentions_full"
umls_dict = {
    "name": "umls",
    "filepath": "path/to/umls",
}
ontology = BiomedicalOntology.load_umls(**umls_dict)
```


## ArboEL


ArboEL operates in two stages: First, you need to train the biencoder (`load_arboel_biencoder`). Then, you use the candidate results from the biencoder to train the crossencoder (`load_arboel_crossencoder`) and perform evaluation with the crossencoder.

## SapBERT

To train SapBERT, the first step is to generate the aliases, which can be done using the following method:

```
# Example with medic ontology
from bioel.models.sapbert.data.data_processing import cuis_to_aliases
from bioel.ontology import BiomedicalOntology

ontology = BiomedicalOntology.load_medic(
    filepath="path/to/medic", name="medic"
)

cuis_to_aliases(
    ontology= ontology, # ontology object
    save_dir="path/to/alias.txt", # path where to save the aliases
    dataset_name="ncbi_disease", # name of the dataset
)
```

## BioBART/BioGenEL

BioBART and BioGenEL share the same entity linking module: 
- In order to finetune from BioBART set the `model_load_path` parameter in the .json config file to `GanjinZero/biobart-v2-large`, it will load the pretrained weights from HuggingFace.
  
- In order the finetune from BioGenEL's Knowledge base guided pretrained weights, you first must download the pretrained weights from this link: https://drive.google.com/file/d/1TqvQRau1WPYE9hKfemKZr-9ptE-7USAH/view?usp=sharing and then set the `model_load_path` parameter in the .json config file to the path where you stored the pretrained weights.

There is only one config file to handle the training and the evaluation for BioBART/BioGenEL. You cannot launch the training and the evaluation in the same time, thus first set all the parameters in the config file for both the evaluation and the training. When you run training set `model_load_path` as described above.
Some important information:
- if `model_load_path` is not provided, by default the dataset folder will be `biogenel/data/`
- set `preprocess_data` to `true` during the first training for a given dataset or ontology to generate the preprocessed data. If you want to train with different parameters while using the same preprocessed data you can set it to `false`. During evaluation set it to `false`.
- `trie_path` and `dict_path` will be set to `data/abbreviationmode/datasetname/` folder for a given `data` folder. Thus, if `resolve_abbrevs` is set to `true`, make sure to include `/abbr_res/datasetname/trie.pkl` as a path extension for the trie and `/abbr_res/datasetname/target_kb.json` for the dict.
- For evaluation set `evaluation` to `true` and change the `model_load_path` to the path you provided to `model_save_path` during training. 


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/davidkartchner/biomedical-entity-linking/tree/main/bioel",
    "name": "bioel",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "bioel, biomedical, entity, linking, entity-linking, biomedical-entity-linking",
    "author": "Pathology Dynamics Lab, Georgia Institute of Technology",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/9f/2e/c50d52fd3b6a8ba961a36294a00b425b19633f218866328704f5e0a1c77c/bioel-0.1.13.tar.gz",
    "platform": null,
    "description": "# BioEL: A comprehensive package for training, evaluating, and benchmarking biomedical entity linking models.\n\n## Installation\n\n\n\n```bash\nconda create -n bioel python=3.9\nconda activate bioel\npip install bioel\n\npython3 -m pip install pip==24.0\n\ngit clone https://github.com/pytorch/fairseq\ncd fairseq\npip install --editable ./\n\npip install numpy==1.23\n```\n\nBiobart and Biogenel require transformers version 4.44.2. This specific version of transformers depends on functionality available in numpy version 1.23.\n\nCurrently, installing `fairseq` requires cloning the repository directly, as we encountered an issue when attempting installation via pip. We have reported the issue to the `fairseq` team, and once it's resolved, we will update our setup accordingly.\n\n## Development Instructions\n\n1. Install it as an editable package using `pip` as shown above.\n1. Add any new dependencies to `setup.py`.\n1. Add tests to `tests/` directory.\n\n## Ontologies\nOntologies included in the package : \n\n- UMLS : UMLS is licensed by the National Library of Medicine and requires a free account to download. You can sign up for an account at https://uts.nlm.nih.gov/uts/signup-login. Once your account has been approved, you can download the UMLS metathesaurus at https://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html.\n\n- MeSH : MeSH is derived from UMLS\n\n- NCBI Gene (Entrez) ontololgy. It can be downloaded at https://ftp.ncbi.nih.gov/gene/DATA/\n(gene_info.gz)\n\n- MEDIC (MErged DIsease voCabulary) : It can be downloaded at https://ctdbase.org/downloads/\n\n## Resolving abbreviations\nAs a preprocessing step, we resolve abbreviations in the text using Ab3P, an abbreviation detector created for biomedical text. We ran abbreviation detection on the text of all documents in our benchmark, the results of which are stored in a large dictionary in `data/abbreviations.json`. In order to reproduce our abbreviation detection/resolution pipeline, please run the following:\n\n```\nfrom bioel.utils.solve_abbreviation.solve_abbreviations import create_abbrev\ncreate_abbrev(output_dir, all_dataset)\n# output_path :\u00a0path where to create abbreviations.json\n# all_dataset : datasets for which you want the abbreviations.\n```\n\n## Example Usage: Model Training, Inference, and Evaluation\n```\n# Import modules\nfrom bioel.model import BioEL_Model\nfrom bioel.evaluate import Evaluate\n\n# 1) load model\n#\u00a0Look at data/params.json for more information about the config file\nkrissbert = BioEL_Model.load_krissbert(\n        name=\"krissbert\",\n        params_file='path/to/params_krissbert.json\", # config file\n        # checkpoint_path=\"path/to/checkpoint\" # if you have an already trained model\n    )\n\n# 2) Training and inference\nkrissbert.training() # Train\nkrissbert.inference() # Inference\n\nabbreviations_path = \"data/abbreviations.json\"\ndataset_names = [\"ncbi_disease\"]\nmodel_names = [\"krissbert\"]\npath_to_result = {\n    \"ncbi_disease\": {\n        \"krissbert\": \"results/ncbi_disease.json\"\n    }\n}\neval_strategies=[\"basic\"]\n\n# 3) Results\nevaluator = Evaluate(dataset_names=dataset_names, \n                     model_names=model_names, \n                     path_to_result=path_to_result, \n                     abbreviations_path=abbreviations_path, \n                     eval_strategies=eval_strategies,\n                     max_k=10, # number of candidates to consider\n                     )\nevaluator.load_results()\nevaluator.process_datasets()\nevaluator.evaluate()\nevaluator.plot_results()\nevaluator.detailed_results()\n```\n1. Load the Model: Use the `BioEL_Model` class to load your model. Ensure you have a configuration file (params.json) ready.\n2. Training and Inference: Perform training and inference using the training() and inference() methods.\n3. Evaluation:\n        Run evaluations for all models across all datasets using the `Evaluate` class.\n        For error analysis with hit index details, use the `evaluator.error_analysis_dfs` attribute.\n        For detailed performance metrics such as failure stage breakdown, accuracy per type, recall@k per type, MAP@k, and statistical significance (p-values), refer to the `evaluator.detailed_results_analysis` attribute.\n\n## Config files\nExample of configuration files for the various models are provided in the `data/` directory. These can serve as references for users to create or modify their own configuration files.\n\n## Load the different datasets\n```\nfrom bioel.dataset import Dataset\nabbreviations_path = \"data/abbreviations.json\"\ndataset_name = \"bc5cdr\" # Specify the desired dataset name from the BigBio collection here.\ndataset = Dataset(\n    dataset_name = dataset_name, abbreviations_path=abbreviations_path\n)\n\n```\n\n\n## Load the different ontologies\n\n```\nfrom bioel.ontology import BiomedicalOntology\n\n##### ----------------- Load medic ----------------- #####\n\ndataset_name = 'ncbi_disease'\nmedic_dict = {\"name\" : \"medic\",\n            \"filepath\" : \"path/to/medic\"} # medic.tsv file\n\nontology = BiomedicalOntology.load_medic(**medic_dict)\n\n##### ----------------- Load entrez ----------------- #####\n\ndataset_name = \"gnormplus\" # or \"nlm_gene\"\n\nentrez_dict = {\"name\" : \"entrez\",\n             \"filepath\" : \"path/to/entrez\", # gene_info.tsv file\n             \"dataset\" : f\"{dataset_name}\",}\nontology = BiomedicalOntology.load_entrez(**entrez_dict)\n\n##### ----------------- Load MESH ----------------- #####\n\ndataset_name = \"nlmchem\"\nmesh_dict = {\"name\" : \"mesh\",\n             \"filepath\" : \"path/to/umls\"}\nontology = BiomedicalOntology.load_mesh(**mesh_dict)\n\n##### ----------------- Load UMLS (st21pv subset) ----------------- #####\n\ndataset_name = \"medmentions_st21pv\"\numls_dict_st21pv = {\n    \"name\": \"umls\",\n    \"filepath\": \"path/to/umls\",\n    \"path_st21pv_cui\": \"data/umls_cuis_st21pv.json\",\n}\nontology = BiomedicalOntology.load_umls(**umls_dict_st21pv)\n\n##### ----------------- Load UMLS (full) ----------------- #####\n\ndataset_name = \"medmentions_full\"\numls_dict = {\n    \"name\": \"umls\",\n    \"filepath\": \"path/to/umls\",\n}\nontology = BiomedicalOntology.load_umls(**umls_dict)\n```\n\n\n## ArboEL\n\n\nArboEL operates in two stages: First, you need to train the biencoder (`load_arboel_biencoder`). Then, you use the candidate results from the biencoder to train the crossencoder (`load_arboel_crossencoder`) and perform evaluation with the crossencoder.\n\n## SapBERT\n\nTo train SapBERT, the first step is to generate the aliases, which can be done using the following method:\n\n```\n# Example with medic ontology\nfrom bioel.models.sapbert.data.data_processing import cuis_to_aliases\nfrom bioel.ontology import BiomedicalOntology\n\nontology = BiomedicalOntology.load_medic(\n    filepath=\"path/to/medic\", name=\"medic\"\n)\n\ncuis_to_aliases(\n    ontology= ontology, # ontology object\n    save_dir=\"path/to/alias.txt\", # path where to save the aliases\n    dataset_name=\"ncbi_disease\", # name of the dataset\n)\n```\n\n## BioBART/BioGenEL\n\nBioBART and BioGenEL share the same entity linking module: \n- In order to finetune from BioBART set the `model_load_path` parameter in the .json config file to `GanjinZero/biobart-v2-large`, it will load the pretrained weights from HuggingFace.\n  \n- In order the finetune from BioGenEL's Knowledge base guided pretrained weights, you first must download the pretrained weights from this link: https://drive.google.com/file/d/1TqvQRau1WPYE9hKfemKZr-9ptE-7USAH/view?usp=sharing and then set the `model_load_path` parameter in the .json config file to the path where you stored the pretrained weights.\n\nThere is only one config file to handle the training and the evaluation for BioBART/BioGenEL. You cannot launch the training and the evaluation in the same time, thus first set all the parameters in the config file for both the evaluation and the training. When you run training set `model_load_path` as described above.\nSome important information:\n- if `model_load_path` is not provided, by default the dataset folder will be `biogenel/data/`\n- set `preprocess_data` to `true` during the first training for a given dataset or ontology to generate the preprocessed data. If you want to train with different parameters while using the same preprocessed data you can set it to `false`. During evaluation set it to `false`.\n- `trie_path` and `dict_path` will be set to `data/abbreviationmode/datasetname/` folder for a given `data` folder. Thus, if `resolve_abbrevs` is set to `true`, make sure to include `/abbr_res/datasetname/trie.pkl` as a path extension for the trie and `/abbr_res/datasetname/target_kb.json` for the dict.\n- For evaluation set `evaluation` to `true` and change the `model_load_path` to the path you provided to `model_save_path` during training. \n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "An easy-to-use package for all your biomedical entity linking needs.",
    "version": "0.1.13",
    "project_urls": {
        "Homepage": "https://github.com/davidkartchner/biomedical-entity-linking/tree/main/bioel"
    },
    "split_keywords": [
        "bioel",
        " biomedical",
        " entity",
        " linking",
        " entity-linking",
        " biomedical-entity-linking"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "9f2ec50d52fd3b6a8ba961a36294a00b425b19633f218866328704f5e0a1c77c",
                "md5": "5d5173d11b0e2a7c387a46ae0629996a",
                "sha256": "741f4366f58df7dcd89d3fec47e939ed563b944af7c0ce04dc7bc50c1663ef14"
            },
            "downloads": -1,
            "filename": "bioel-0.1.13.tar.gz",
            "has_sig": false,
            "md5_digest": "5d5173d11b0e2a7c387a46ae0629996a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 332112,
            "upload_time": "2025-02-08T16:14:30",
            "upload_time_iso_8601": "2025-02-08T16:14:30.577101Z",
            "url": "https://files.pythonhosted.org/packages/9f/2e/c50d52fd3b6a8ba961a36294a00b425b19633f218866328704f5e0a1c77c/bioel-0.1.13.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-08 16:14:30",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "davidkartchner",
    "github_project": "biomedical-entity-linking",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "bioel"
}
        
Elapsed time: 0.46161s