drug-named-entity-recognition


Namedrug-named-entity-recognition JSON
Version 2.0.4 PyPI version JSON
download
home_pagehttps://fastdatascience.com/drug-named-entity-recognition-python-library
SummaryDrug Named Entity Recognition library to find and resolve drug names in a string (drug named entity linking)
upload_time2024-10-14 13:34:19
maintainerNone
docs_urlNone
authorThomas Wood
requires_python>=3.6
licenseMIT License Copyright (c) 2023 Fast Data Science Ltd (https://fastdatascience.com). Maintainer: Thomas Wood. Tutorial at https://fastdatascience.com/drug-named-entity-recognition-python-library/ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords drug bio biomedical medical pharma pharmaceutical ner nlp named entity recognition natural language processing named entity linking
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ![Fast Data Science logo](https://raw.githubusercontent.com/fastdatascience/brand/main/primary_logo.svg)

<a href="https://fastdatascience.com"><span align="left">🌐 fastdatascience.com</span></a>
<a href="https://www.linkedin.com/company/fastdatascience/"><img align="left" src="https://raw.githubusercontent.com//harmonydata/.github/main/profile/linkedin.svg" alt="Fast Data Science | LinkedIn" width="21px"/></a>
<a href="https://twitter.com/fastdatascienc1"><img align="left" src="https://raw.githubusercontent.com//harmonydata/.github/main/profile/x.svg" alt="Fast Data Science | X" width="21px"/></a>
<a href="https://www.instagram.com/fastdatascience/"><img align="left" src="https://raw.githubusercontent.com//harmonydata/.github/main/profile/instagram.svg" alt="Fast Data Science | Instagram" width="21px"/></a>
<a href="https://www.facebook.com/fastdatascienceltd"><img align="left" src="https://raw.githubusercontent.com//harmonydata/.github/main/profile/fb.svg" alt="Fast Data Science | Facebook" width="21px"/></a>
<a href="https://www.youtube.com/channel/UCLPrDH7SoRT55F6i50xMg5g"><img align="left" src="https://raw.githubusercontent.com//harmonydata/.github/main/profile/yt.svg" alt="Fast Data Science | YouTube" width="21px"/></a>
<a href="https://g.page/fast-data-science"><img align="left" src="https://raw.githubusercontent.com//harmonydata/.github/main/profile/google.svg" alt="Fast Data Science | Google" width="21px"/></a>
<a href="https://medium.com/fast-data-science"><img align="left" src="https://raw.githubusercontent.com//harmonydata/.github/main/profile/medium.svg" alt="Fast Data Science | Medium" width="21px"/></a>
<a href="https://mastodon.social/@fastdatascience"><img align="left" src="https://raw.githubusercontent.com//harmonydata/.github/main/profile/mastodon.svg" alt="Fast Data Science | Mastodon" width="21px"/></a>

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.10970631.svg)](https://doi.org/10.5281/zenodo.10970631)


You can run the walkthrough Python notebook in [Google Colab](https://colab.research.google.com/github/fastdatascience/drug_named_entity_recognition/blob/main/drug_named_entity_recognition_example_walkthrough.ipynb) with a single click: <a href="https://colab.research.google.com/github/fastdatascience/drug_named_entity_recognition/blob/main/drug_named_entity_recognition_example_walkthrough.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Drug Named Entity Recognition Python library by Fast Data Science

## Finds drug names and can look up molecular structure

<!-- badges: start -->
![my badge](https://badgen.net/badge/Status/In%20Development/orange)
[![PyPI package](https://img.shields.io/badge/pip%20install-drug_named_entity_recognition-brightgreen)](https://pypi.org/project/drug-named-entity-recognition/) [![version number](https://img.shields.io/pypi/v/drug-named-entity-recognition?color=green&label=version)](https://github.com/fastdatascience/drug_named_entity_recognition/releases) [![License](https://img.shields.io/github/license/fastdatascience/drug_named_entity_recognition)](https://github.com/fastdatascience/drug_named_entity_recognition/blob/main/LICENSE)
[![pypi Version](https://img.shields.io/pypi/v/drug_named_entity_recognition.svg?style=flat-square&logo=pypi&logoColor=white)](https://pypi.org/project/drug_named_entity_recognition/)
 [![version number](https://img.shields.io/pypi/v/drug_named_entity_recognition?color=green&label=version)](https://github.com/fastdatascience/drug_named_entity_recognition/releases) [![PyPi downloads](https://static.pepy.tech/personalized-badge/drug_named_entity_recognition?period=total&units=international_system&left_color=grey&right_color=orange&left_text=pip%20downloads)](https://pypi.org/project/drug_named_entity_recognition/)
[![forks](https://img.shields.io/github/forks/fastdatascience/drug_named_entity_recognition)](https://github.com/fastdatascience/drug_named_entity_recognition/forks)

<!-- badges: end -->

# 💊 Drug Named Entity Recognition

Developed by Fast Data Science, https://fastdatascience.com

Source code at https://github.com/fastdatascience/drug_named_entity_recognition

Tutorial at https://fastdatascience.com/drug-named-entity-recognition-python-library/

This is a lightweight Python [natural language processing](https://naturallanguageprocessing.com/applications-nlp-use-cases-business-solutions/) library for finding drug names in a string, otherwise known as [named entity recognition (NER)](https://fastdatascience.com/named-entity-recognition/) and named entity linking. This can be used for analysing unstructured text documents such as [clinical trial protocols](https://clinicaltrialrisk.org/clinical-trial-protocol-software/) or other [unstructured data in biopharma](https://harmonydata.ac.uk/data-harmonisation/data-harmonisation-in-biopharma/). 

Please note this library finds only high confidence drugs.

**New in version 2.0.0: we can support misspellings by using fuzzy matching!**

It also only finds the English names of these drugs. Names [in other languages](https://fastdatascience.com/natural-language-processing/multilingual-natural-language-processing/) are not supported.

It also doesn't find short code names of drugs, such as abbreviations commonly used in medicine, such as "Ceph" for "Cephradin" - as these are highly ambiguous.

# 💻Installing Drug Named Entity Recognition Python package

You can install Drug Named Entity Recognition from [PyPI](https://pypi.org/project/drug-named-entity-recognition).

```
pip install drug-named-entity-recognition
```

If you get an error installing Drug Named Entity Recognition, try making a new Python environment in Conda (`conda create -n test-env; conda activate test-env`) or Venv (`python -m testenv; source testenv/bin/activate` / `testenv\Scripts\activate`) and then installing the library.

The library already contains the drug names so if you don't need to update the dictionary, then you should not have to run any of the download scripts.

If you have problems installing, try our [Google Colab](https://colab.research.google.com/github/fastdatascience/drug_named_entity_recognition/blob/main/drug_named_entity_recognition_example_walkthrough.ipynb) walkthrough.

# 💡Usage examples

You must first tokenise your input text using a tokeniser of your choice (NLTK, spaCy, etc).

You pass a list of strings to the `find_drugs` function.

Example 1

```
from drug_named_entity_recognition import find_drugs

find_drugs("i bought some Prednisone".split(" "))
```

outputs a list of tuples.

```
[({'name': 'Prednisone', 'synonyms': {'Sone', 'Sterapred', 'Deltasone', 'Panafcort', 'Prednidib', 'Cortan', 'Rectodelt', 'Prednisone', 'Cutason', 'Meticorten', 'Panasol', 'Enkortolon', 'Ultracorten', 'Decortin', 'Orasone', 'Winpred', 'Dehydrocortisone', 'Dacortin', 'Cortancyl', 'Encorton', 'Encortone', 'Decortisyl', 'Kortancyl', 'Pronisone', 'Prednisona', 'Predniment', 'Prednisonum', 'Rayos'}, 'medline_plus_id': 'a601102', 'mesh_id': 'D018931', 'drugbank_id': 'DB00635'}, 3, 3)]
```

## Fuzzy matching (misspellings)

You can turn on fuzzy matching (ngrams) with `is_fuzzy_match`

```
find_drugs(["paraxcetamol"], is_fuzzy_match=True)
```

## Metadata such as molecular structure.

You can retrieve molecular structure in `.mol` format with:

```
from drug_named_entity_recognition.drugs_finder import find_drugs
drugs = find_drugs("i bought some paracetamol".split(" "), is_include_structure=True)
```

this will return the atomic structure of the drug if that data is available.

```
>>> print (drugs[0][0]["structure_mol"])
316
  Mrv0541 02231214352D          

 11 11  0  0  0  0            999 V2000
    2.3645   -2.1409    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    3.7934    1.1591    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    2.3645    1.1591    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    2.3645    0.3341    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.0790   -0.0784    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.6500   -0.0784    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.0790   -0.9034    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.6500   -0.9034    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.3645   -1.3159    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.0790    1.5716    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.0790    2.3966    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  9  1  0  0  0  0
  2 10  2  0  0  0  0
  3  4  1  0  0  0  0
  3 10  1  0  0  0  0
  4  5  2  0  0  0  0
  4  6  1  0  0  0  0
  5  7  1  0  0  0  0
  6  8  2  0  0  0  0
  7  9  2  0  0  0  0
  8  9  1  0  0  0  0
 10 11  1  0  0  0  0
M  END
DB00316

```


## Add and remove drugs (customise the drugs list)

Now you can modify the drug recogniser's behaviour if there is a particular drug which it isn't finding:

To reset the drugs dictionary

```
from drug_named_entity_recognition.drugs_finder import reset_drugs_data
reset_drugs_data()
```

To add a synonym

```
from drug_named_entity_recognition.drugs_finder import add_custom_drug_synonym
add_custom_drug_synonym("potato", "sertraline")
```

To add a new drug

```
from drug_named_entity_recognition.drugs_finder import add_custom_new_drug
add_custom_new_drug("potato", {"name": "solanum tuberosum"})
```

To remove an existing drug

```
from drug_named_entity_recognition.drugs_finder import remove_drug_synonym
remove_drug_synonym("sertraline")
```

# Interested in other kinds of named entity recognition (NER)? 💸Finances, 🎩company names, 🌎countries, 🗺️locations, proteins, 🧬genes, 🧪molecules?

If your NER problem is common across industries and likely to have been seen before, there may be an off-the-shelf NER tool for your purposes, such as our [Country Named Entity Recognition](http://fastdatascience.com//country-named-entity-recognition/) Python library. Dictionary-based named entity recognition is not always the solution, as sometimes the total set of entities is an open set and can't be listed (e.g. personal names), so sometimes a bespoke trained NER model is the answer. For tasks like finding email addresses or phone numbers, regular expressions (simple rules) are sufficient for the job.

If your named entity recognition or named entity linking problem is very niche and unusual, and a product exists for that problem, that product is likely to only solve your problem 80% of the way, and you will have more work trying to fix the final mile than if you had done the whole thing manually. Please [contact Fast Data Science](http://fastdatascience.com//contact) and we'll be glad to discuss. For example, we've worked on [a consultancy engagement to find molecule names in papers, and match author names to customers](http://fastdatascience.com//boehringer-ingelheim-finding-molecules-and-proteins-in-scientific-literature/) where the goal was to trace molecule samples ordered from a pharma company and identify when the samples resulted in a publication. For this case, there was no off-the-shelf library that we could use.

For a problem like identifying country names in English, which is a closed set with well-known variants and aliases, an off-the-shelf library is usually available. You may wish to try our [Country Named Entity Recognition](https://fastdatascience.com/country-named-entity-recognition/) library, also open-source and under MIT license.

For identifying a set of molecules manufactured by a particular company, this is the kind of task more suited to a [consulting engagement](https://fastdatascience.com/portfolio/nlp-consultant/).

# 😊 Using this tool directly from Google Sheets (no-code!)

<img align="left" alt="Google Sheets logo" title="Google Sheets logo" width=150 height=105  src="google_sheets_logo_small.png" />

We have a no-code solution where you can [use the library directly from Google Sheets](https://fastdatascience.com/drug-name-recogniser) as the library has also been wrapped as a Google Sheets plugin.

[Click here](https://www.youtube.com/watch?v=qab1Bv_YpYU) to watch a video of how the plugin works.

You can install the plugin in Google Sheets [here](https://workspace.google.com/marketplace/app/drug_name_recogniser/463844408236).

![google_sheets_screenshot.png](google_sheets_screenshot.png)

# Requirements

Python 3.9 and above

## ✉️Who to contact?

You can contact Thomas Wood or the Fast Data Science team at https://fastdatascience.com/.

# 🤝Compatibility with other natural language processing libraries

The Drug Named Entity Recognition library is independent of other NLP tools and has no dependencies. You don't need any advanced system requirements and the tool is lightweight. However, it combines well with other libraries  such as [spaCy](https://spacy.io) or the [Natural Language Toolkit (NLTK)](https://www.nltk.org/api/nltk.tokenize.html).

## Using Drug Named Entity Recognition together with spaCy

Here is an example call to the tool with a [spaCy](https://spacy.io) Doc object:

```
from drug_named_entity_recognition import find_drugs
import spacy
nlp = spacy.blank("en")
doc = nlp("i routinely rx rimonabant and pts prefer it")
find_drugs([t.text for t in doc])
```

outputs:

```
[({'name': 'Rimonabant', 'synonyms': {'Acomplia', 'Rimonabant', 'Zimulti'}, 'mesh_id': 'D063387', 'drugbank_id': 'DB06155'}, 3, 3)]
```

## Using Drug Named Entity Recognition together with NLTK

You can also use the tool together with the [Natural Language Toolkit (NLTK)](https://www.nltk.org/api/nltk.tokenize.html):

```
from drug_named_entity_recognition import find_drugs
from nltk.tokenize import wordpunct_tokenize
tokens = wordpunct_tokenize("i routinely rx rimonabant and pts prefer it")
find_drugs(tokens)
```

# 📁Data sources

The main data source is from Drugbank, augmented by datasets from the NHS, MeSH, Medline Plus and Wikipedia.

🌟 There is a handy Jupyter Notebook, `update.ipynb` which will update the Drugbank and MeSH data sources (re-download them from the relevant third parties). 

## Update the Drugbank dictionary

If you want to update the dictionary, you can use the data dump from Drugbank and replace the file `drugbank vocabulary.csv`:

* Download the open data dump from https://go.drugbank.com/releases/latest#open-data

## Update the Wikipedia dictionary

If you want to update the Wikipedia dictionary, download the dump from:

* https://meta.wikimedia.org/wiki/Data_dump_torrents#English_Wikipedia

and run `extract_drug_names_and_synonyms_from_wikipedia_dump.py`

## Update the MeSH dictionary

If you want to update the dictionary, run

```
python download_mesh_dump_and_extract_drug_names_and_synonyms.py
```

This will download the latest XML file from NIH.

If the link doesn't work, download the open data dump manually from https://www.nlm.nih.gov/. It should be called something like `desc2023.xml`. And comment out the Wget/Curl commands in the code.

## License information for external data sources

* Data from Drugbank is licensed under [CC0](https://go.drugbank.com/releases/latest#open-data).

```
To the extent possible under law, the person who associated CC0 with the DrugBank Open Data has waived all copyright and related or neighboring rights to the DrugBank Open Data. This work is published from: Canada.
```

* Text from Wikipedia data dump is licensed under [GNU Free Documentation License](https://www.gnu.org/licenses/fdl-1.3.html) and [Creative Commons Attribution-Share-Alike 3.0 License](https://creativecommons.org/licenses/by-sa/3.0/). [More information](https://dumps.wikimedia.org/legal.html).

## Contributing to the Drug Named Entity Recognition library

If you'd like to contribute to this project, you can contact us at https://fastdatascience.com/ or make a pull request on our [Github repository](https://github.com/fastdatascience/drug_named_entity_recognition). You can also [raise an issue](https://github.com/fastdatascience/drug_named_entity_recognition/issues). 

## Developing the Drug Named Entity Recognition library

### Automated tests

Test code is in **tests/** folder using [unittest](https://docs.python.org/3/library/unittest.html).

The testing tool `tox` is used in the automation with GitHub Actions CI/CD.

### Use tox locally

Install tox and run it:

```
pip install tox
tox
```

In our configuration, tox runs a check of source distribution using [check-manifest](https://pypi.org/project/check-manifest/) (which requires your repo to be git-initialized (`git init`) and added (`git add .`) at least), setuptools's check, and unit tests using pytest. You don't need to install check-manifest and pytest though, tox will install them in a separate environment.

The automated tests are run against several Python versions, but on your machine, you might be using only one version of Python, if that is Python 3.9, then run:

```
tox -e py39
```

Thanks to GitHub Actions' automated process, you don't need to generate distribution files locally. But if you insist, click to read the "Generate distribution files" section.

### 🤖 Continuous integration/deployment to PyPI

This package is based on the template https://pypi.org/project/example-pypi-package/

This package

- uses GitHub Actions for both testing and publishing
- is tested when pushing `master` or `main` branch, and is published when create a release
- includes test files in the source distribution
- uses **setup.cfg** for [version single-sourcing](https://packaging.python.org/guides/single-sourcing-package-version/) (setuptools 46.4.0+)

## 🧍Re-releasing the package manually

The code to re-release Drug Named Entity Recognition on PyPI is as follows:

```
source activate py311
pip install twine
rm -rf dist
python setup.py sdist
twine upload dist/*
```

## 😊 Who worked on the Drug Named Entity Recognition library?

The tool was developed by:

* Thomas Wood ([Fast Data Science](https://fastdatascience.com))

## 📜License of Drug Named Entity Recognition library

MIT License. Copyright (c) 2023 [Fast Data Science](https://fastdatascience.com)

## ✍️ Citing the Drug Named Entity Recognition library

Wood, T.A., Drug Named Entity Recognition [Computer software], Version 2.0.4, accessed at [https://fastdatascience.com/drug-named-entity-recognition-python-library](https://fastdatascience.com/drug-named-entity-recognition-python-library), Fast Data Science Ltd (2024)

```
@unpublished{drugnamedentityrecognition,
    AUTHOR = {Wood, T.A.},
    TITLE  = {Drug Named Entity Recognition (Computer software), Version 2.0.4},
    YEAR   = {2024},
    Note   = {To appear},
    url = {https://zenodo.org/doi/10.5281/zenodo.10970631},
    doi = {10.5281/zenodo.10970631}
}
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://fastdatascience.com/drug-named-entity-recognition-python-library",
    "name": "drug-named-entity-recognition",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "Thomas Wood <thomas@fastdatascience.com>",
    "keywords": "drug, bio, biomedical, medical, pharma, pharmaceutical, ner, nlp, named entity recognition, natural language processing, named entity linking",
    "author": "Thomas Wood",
    "author_email": "Thomas Wood <thomas@fastdatascience.com>",
    "download_url": "https://files.pythonhosted.org/packages/34/4b/675d4e9100ef3192d543eae51f90a53bbcea04143e4cdf7736d08029e584/drug_named_entity_recognition-2.0.4.tar.gz",
    "platform": null,
    "description": "![Fast Data Science logo](https://raw.githubusercontent.com/fastdatascience/brand/main/primary_logo.svg)\n\n<a href=\"https://fastdatascience.com\"><span align=\"left\">\ud83c\udf10 fastdatascience.com</span></a>\n<a href=\"https://www.linkedin.com/company/fastdatascience/\"><img align=\"left\" src=\"https://raw.githubusercontent.com//harmonydata/.github/main/profile/linkedin.svg\" alt=\"Fast Data Science | LinkedIn\" width=\"21px\"/></a>\n<a href=\"https://twitter.com/fastdatascienc1\"><img align=\"left\" src=\"https://raw.githubusercontent.com//harmonydata/.github/main/profile/x.svg\" alt=\"Fast Data Science | X\" width=\"21px\"/></a>\n<a href=\"https://www.instagram.com/fastdatascience/\"><img align=\"left\" src=\"https://raw.githubusercontent.com//harmonydata/.github/main/profile/instagram.svg\" alt=\"Fast Data Science | Instagram\" width=\"21px\"/></a>\n<a href=\"https://www.facebook.com/fastdatascienceltd\"><img align=\"left\" src=\"https://raw.githubusercontent.com//harmonydata/.github/main/profile/fb.svg\" alt=\"Fast Data Science | Facebook\" width=\"21px\"/></a>\n<a href=\"https://www.youtube.com/channel/UCLPrDH7SoRT55F6i50xMg5g\"><img align=\"left\" src=\"https://raw.githubusercontent.com//harmonydata/.github/main/profile/yt.svg\" alt=\"Fast Data Science | YouTube\" width=\"21px\"/></a>\n<a href=\"https://g.page/fast-data-science\"><img align=\"left\" src=\"https://raw.githubusercontent.com//harmonydata/.github/main/profile/google.svg\" alt=\"Fast Data Science | Google\" width=\"21px\"/></a>\n<a href=\"https://medium.com/fast-data-science\"><img align=\"left\" src=\"https://raw.githubusercontent.com//harmonydata/.github/main/profile/medium.svg\" alt=\"Fast Data Science | Medium\" width=\"21px\"/></a>\n<a href=\"https://mastodon.social/@fastdatascience\"><img align=\"left\" src=\"https://raw.githubusercontent.com//harmonydata/.github/main/profile/mastodon.svg\" alt=\"Fast Data Science | Mastodon\" width=\"21px\"/></a>\n\n[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.10970631.svg)](https://doi.org/10.5281/zenodo.10970631)\n\n\nYou can run the walkthrough Python notebook in [Google Colab](https://colab.research.google.com/github/fastdatascience/drug_named_entity_recognition/blob/main/drug_named_entity_recognition_example_walkthrough.ipynb) with a single click: <a href=\"https://colab.research.google.com/github/fastdatascience/drug_named_entity_recognition/blob/main/drug_named_entity_recognition_example_walkthrough.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n\n# Drug Named Entity Recognition Python library by Fast Data Science\n\n## Finds drug names and can look up molecular structure\n\n<!-- badges: start -->\n![my badge](https://badgen.net/badge/Status/In%20Development/orange)\n[![PyPI package](https://img.shields.io/badge/pip%20install-drug_named_entity_recognition-brightgreen)](https://pypi.org/project/drug-named-entity-recognition/) [![version number](https://img.shields.io/pypi/v/drug-named-entity-recognition?color=green&label=version)](https://github.com/fastdatascience/drug_named_entity_recognition/releases) [![License](https://img.shields.io/github/license/fastdatascience/drug_named_entity_recognition)](https://github.com/fastdatascience/drug_named_entity_recognition/blob/main/LICENSE)\n[![pypi Version](https://img.shields.io/pypi/v/drug_named_entity_recognition.svg?style=flat-square&logo=pypi&logoColor=white)](https://pypi.org/project/drug_named_entity_recognition/)\n [![version number](https://img.shields.io/pypi/v/drug_named_entity_recognition?color=green&label=version)](https://github.com/fastdatascience/drug_named_entity_recognition/releases) [![PyPi downloads](https://static.pepy.tech/personalized-badge/drug_named_entity_recognition?period=total&units=international_system&left_color=grey&right_color=orange&left_text=pip%20downloads)](https://pypi.org/project/drug_named_entity_recognition/)\n[![forks](https://img.shields.io/github/forks/fastdatascience/drug_named_entity_recognition)](https://github.com/fastdatascience/drug_named_entity_recognition/forks)\n\n<!-- badges: end -->\n\n# \ud83d\udc8a Drug Named Entity Recognition\n\nDeveloped by Fast Data Science, https://fastdatascience.com\n\nSource code at https://github.com/fastdatascience/drug_named_entity_recognition\n\nTutorial at https://fastdatascience.com/drug-named-entity-recognition-python-library/\n\nThis is a lightweight Python [natural language processing](https://naturallanguageprocessing.com/applications-nlp-use-cases-business-solutions/) library for finding drug names in a string, otherwise known as [named entity recognition (NER)](https://fastdatascience.com/named-entity-recognition/) and named entity linking. This can be used for analysing unstructured text documents such as [clinical trial protocols](https://clinicaltrialrisk.org/clinical-trial-protocol-software/) or other [unstructured data in biopharma](https://harmonydata.ac.uk/data-harmonisation/data-harmonisation-in-biopharma/). \n\nPlease note this library finds only high confidence drugs.\n\n**New in version 2.0.0: we can support misspellings by using fuzzy matching!**\n\nIt also only finds the English names of these drugs. Names [in other languages](https://fastdatascience.com/natural-language-processing/multilingual-natural-language-processing/) are not supported.\n\nIt also doesn't find short code names of drugs, such as abbreviations commonly used in medicine, such as \"Ceph\" for \"Cephradin\" - as these are highly ambiguous.\n\n# \ud83d\udcbbInstalling Drug Named Entity Recognition Python package\n\nYou can install Drug Named Entity Recognition from [PyPI](https://pypi.org/project/drug-named-entity-recognition).\n\n```\npip install drug-named-entity-recognition\n```\n\nIf you get an error installing Drug Named Entity Recognition, try making a new Python environment in Conda (`conda create -n test-env; conda activate test-env`) or Venv (`python -m testenv; source testenv/bin/activate` / `testenv\\Scripts\\activate`) and then installing the library.\n\nThe library already contains the drug names so if you don't need to update the dictionary, then you should not have to run any of the download scripts.\n\nIf you have problems installing, try our [Google Colab](https://colab.research.google.com/github/fastdatascience/drug_named_entity_recognition/blob/main/drug_named_entity_recognition_example_walkthrough.ipynb) walkthrough.\n\n# \ud83d\udca1Usage examples\n\nYou must first tokenise your input text using a tokeniser of your choice (NLTK, spaCy, etc).\n\nYou pass a list of strings to the `find_drugs` function.\n\nExample 1\n\n```\nfrom drug_named_entity_recognition import find_drugs\n\nfind_drugs(\"i bought some Prednisone\".split(\" \"))\n```\n\noutputs a list of tuples.\n\n```\n[({'name': 'Prednisone', 'synonyms': {'Sone', 'Sterapred', 'Deltasone', 'Panafcort', 'Prednidib', 'Cortan', 'Rectodelt', 'Prednisone', 'Cutason', 'Meticorten', 'Panasol', 'Enkortolon', 'Ultracorten', 'Decortin', 'Orasone', 'Winpred', 'Dehydrocortisone', 'Dacortin', 'Cortancyl', 'Encorton', 'Encortone', 'Decortisyl', 'Kortancyl', 'Pronisone', 'Prednisona', 'Predniment', 'Prednisonum', 'Rayos'}, 'medline_plus_id': 'a601102', 'mesh_id': 'D018931', 'drugbank_id': 'DB00635'}, 3, 3)]\n```\n\n## Fuzzy matching (misspellings)\n\nYou can turn on fuzzy matching (ngrams) with `is_fuzzy_match`\n\n```\nfind_drugs([\"paraxcetamol\"], is_fuzzy_match=True)\n```\n\n## Metadata such as molecular structure.\n\nYou can retrieve molecular structure in `.mol` format with:\n\n```\nfrom drug_named_entity_recognition.drugs_finder import find_drugs\ndrugs = find_drugs(\"i bought some paracetamol\".split(\" \"), is_include_structure=True)\n```\n\nthis will return the atomic structure of the drug if that data is available.\n\n```\n>>> print (drugs[0][0][\"structure_mol\"])\n316\n  Mrv0541 02231214352D          \n\n 11 11  0  0  0  0            999 V2000\n    2.3645   -2.1409    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0\n    3.7934    1.1591    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0\n    2.3645    1.1591    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0\n    2.3645    0.3341    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    3.0790   -0.0784    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    1.6500   -0.0784    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    3.0790   -0.9034    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    1.6500   -0.9034    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    2.3645   -1.3159    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    3.0790    1.5716    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    3.0790    2.3966    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n  1  9  1  0  0  0  0\n  2 10  2  0  0  0  0\n  3  4  1  0  0  0  0\n  3 10  1  0  0  0  0\n  4  5  2  0  0  0  0\n  4  6  1  0  0  0  0\n  5  7  1  0  0  0  0\n  6  8  2  0  0  0  0\n  7  9  2  0  0  0  0\n  8  9  1  0  0  0  0\n 10 11  1  0  0  0  0\nM  END\nDB00316\n\n```\n\n\n## Add and remove drugs (customise the drugs list)\n\nNow you can modify the drug recogniser's behaviour if there is a particular drug which it isn't finding:\n\nTo reset the drugs dictionary\n\n```\nfrom drug_named_entity_recognition.drugs_finder import reset_drugs_data\nreset_drugs_data()\n```\n\nTo add a synonym\n\n```\nfrom drug_named_entity_recognition.drugs_finder import add_custom_drug_synonym\nadd_custom_drug_synonym(\"potato\", \"sertraline\")\n```\n\nTo add a new drug\n\n```\nfrom drug_named_entity_recognition.drugs_finder import add_custom_new_drug\nadd_custom_new_drug(\"potato\", {\"name\": \"solanum tuberosum\"})\n```\n\nTo remove an existing drug\n\n```\nfrom drug_named_entity_recognition.drugs_finder import remove_drug_synonym\nremove_drug_synonym(\"sertraline\")\n```\n\n# Interested in other kinds of named entity recognition (NER)? \ud83d\udcb8Finances, \ud83c\udfa9company names, \ud83c\udf0ecountries, \ud83d\uddfa\ufe0flocations, proteins, \ud83e\uddecgenes, \ud83e\uddeamolecules?\n\nIf your NER problem is common across industries and likely to have been seen before, there may be an off-the-shelf NER tool for your purposes, such as our [Country Named Entity Recognition](http://fastdatascience.com//country-named-entity-recognition/) Python library. Dictionary-based named entity recognition is not always the solution, as sometimes the total set of entities is an open set and can't be listed (e.g. personal names), so sometimes a bespoke trained NER model is the answer. For tasks like finding email addresses or phone numbers, regular expressions (simple rules) are sufficient for the job.\n\nIf your named entity recognition or named entity linking problem is very niche and unusual, and a product exists for that problem, that product is likely to only solve your problem 80% of the way, and you will have more work trying to fix the final mile than if you had done the whole thing manually. Please [contact Fast Data Science](http://fastdatascience.com//contact) and we'll be glad to discuss. For example, we've worked on [a consultancy engagement to find molecule names in papers, and match author names to customers](http://fastdatascience.com//boehringer-ingelheim-finding-molecules-and-proteins-in-scientific-literature/) where the goal was to trace molecule samples ordered from a pharma company and identify when the samples resulted in a publication. For this case, there was no off-the-shelf library that we could use.\n\nFor a problem like identifying country names in English, which is a closed set with well-known variants and aliases, an off-the-shelf library is usually available. You may wish to try our [Country Named Entity Recognition](https://fastdatascience.com/country-named-entity-recognition/) library, also open-source and under MIT license.\n\nFor identifying a set of molecules manufactured by a particular company, this is the kind of task more suited to a [consulting engagement](https://fastdatascience.com/portfolio/nlp-consultant/).\n\n# \ud83d\ude0a Using this tool directly from Google Sheets (no-code!)\n\n<img align=\"left\" alt=\"Google Sheets logo\" title=\"Google Sheets logo\" width=150 height=105  src=\"google_sheets_logo_small.png\" />\n\nWe have a no-code solution where you can [use the library directly from Google Sheets](https://fastdatascience.com/drug-name-recogniser) as the library has also been wrapped as a Google Sheets plugin.\n\n[Click here](https://www.youtube.com/watch?v=qab1Bv_YpYU) to watch a video of how the plugin works.\n\nYou can install the plugin in Google Sheets [here](https://workspace.google.com/marketplace/app/drug_name_recogniser/463844408236).\n\n![google_sheets_screenshot.png](google_sheets_screenshot.png)\n\n# Requirements\n\nPython 3.9 and above\n\n## \u2709\ufe0fWho to contact?\n\nYou can contact Thomas Wood or the Fast Data Science team at https://fastdatascience.com/.\n\n# \ud83e\udd1dCompatibility with other natural language processing libraries\n\nThe Drug Named Entity Recognition library is independent of other NLP tools and has no dependencies. You don't need any advanced system requirements and the tool is lightweight. However, it combines well with other libraries  such as [spaCy](https://spacy.io) or the [Natural Language Toolkit (NLTK)](https://www.nltk.org/api/nltk.tokenize.html).\n\n## Using Drug Named Entity Recognition together with spaCy\n\nHere is an example call to the tool with a [spaCy](https://spacy.io) Doc object:\n\n```\nfrom drug_named_entity_recognition import find_drugs\nimport spacy\nnlp = spacy.blank(\"en\")\ndoc = nlp(\"i routinely rx rimonabant and pts prefer it\")\nfind_drugs([t.text for t in doc])\n```\n\noutputs:\n\n```\n[({'name': 'Rimonabant', 'synonyms': {'Acomplia', 'Rimonabant', 'Zimulti'}, 'mesh_id': 'D063387', 'drugbank_id': 'DB06155'}, 3, 3)]\n```\n\n## Using Drug Named Entity Recognition together with NLTK\n\nYou can also use the tool together with the [Natural Language Toolkit (NLTK)](https://www.nltk.org/api/nltk.tokenize.html):\n\n```\nfrom drug_named_entity_recognition import find_drugs\nfrom nltk.tokenize import wordpunct_tokenize\ntokens = wordpunct_tokenize(\"i routinely rx rimonabant and pts prefer it\")\nfind_drugs(tokens)\n```\n\n# \ud83d\udcc1Data sources\n\nThe main data source is from Drugbank, augmented by datasets from the NHS, MeSH, Medline Plus and Wikipedia.\n\n\ud83c\udf1f There is a handy Jupyter Notebook, `update.ipynb` which will update the Drugbank and MeSH data sources (re-download them from the relevant third parties). \n\n## Update the Drugbank dictionary\n\nIf you want to update the dictionary, you can use the data dump from Drugbank and replace the file `drugbank vocabulary.csv`:\n\n* Download the open data dump from https://go.drugbank.com/releases/latest#open-data\n\n## Update the Wikipedia dictionary\n\nIf you want to update the Wikipedia dictionary, download the dump from:\n\n* https://meta.wikimedia.org/wiki/Data_dump_torrents#English_Wikipedia\n\nand run `extract_drug_names_and_synonyms_from_wikipedia_dump.py`\n\n## Update the MeSH dictionary\n\nIf you want to update the dictionary, run\n\n```\npython download_mesh_dump_and_extract_drug_names_and_synonyms.py\n```\n\nThis will download the latest XML file from NIH.\n\nIf the link doesn't work, download the open data dump manually from https://www.nlm.nih.gov/. It should be called something like `desc2023.xml`. And comment out the Wget/Curl commands in the code.\n\n## License information for external data sources\n\n* Data from Drugbank is licensed under [CC0](https://go.drugbank.com/releases/latest#open-data).\n\n```\nTo the extent possible under law, the person who associated CC0 with the DrugBank Open Data has waived all copyright and related or neighboring rights to the DrugBank Open Data. This work is published from: Canada.\n```\n\n* Text from Wikipedia data dump is licensed under [GNU Free Documentation License](https://www.gnu.org/licenses/fdl-1.3.html) and [Creative Commons Attribution-Share-Alike 3.0 License](https://creativecommons.org/licenses/by-sa/3.0/). [More information](https://dumps.wikimedia.org/legal.html).\n\n## Contributing to the Drug Named Entity Recognition library\n\nIf you'd like to contribute to this project, you can contact us at https://fastdatascience.com/ or make a pull request on our [Github repository](https://github.com/fastdatascience/drug_named_entity_recognition). You can also [raise an issue](https://github.com/fastdatascience/drug_named_entity_recognition/issues). \n\n## Developing the Drug Named Entity Recognition library\n\n### Automated tests\n\nTest code is in **tests/** folder using [unittest](https://docs.python.org/3/library/unittest.html).\n\nThe testing tool `tox` is used in the automation with GitHub Actions CI/CD.\n\n### Use tox locally\n\nInstall tox and run it:\n\n```\npip install tox\ntox\n```\n\nIn our configuration, tox runs a check of source distribution using [check-manifest](https://pypi.org/project/check-manifest/) (which requires your repo to be git-initialized (`git init`) and added (`git add .`) at least), setuptools's check, and unit tests using pytest. You don't need to install check-manifest and pytest though, tox will install them in a separate environment.\n\nThe automated tests are run against several Python versions, but on your machine, you might be using only one version of Python, if that is Python 3.9, then run:\n\n```\ntox -e py39\n```\n\nThanks to GitHub Actions' automated process, you don't need to generate distribution files locally. But if you insist, click to read the \"Generate distribution files\" section.\n\n### \ud83e\udd16 Continuous integration/deployment to PyPI\n\nThis package is based on the template https://pypi.org/project/example-pypi-package/\n\nThis package\n\n- uses GitHub Actions for both testing and publishing\n- is tested when pushing `master` or `main` branch, and is published when create a release\n- includes test files in the source distribution\n- uses **setup.cfg** for [version single-sourcing](https://packaging.python.org/guides/single-sourcing-package-version/) (setuptools 46.4.0+)\n\n## \ud83e\uddcdRe-releasing the package manually\n\nThe code to re-release Drug Named Entity Recognition on PyPI is as follows:\n\n```\nsource activate py311\npip install twine\nrm -rf dist\npython setup.py sdist\ntwine upload dist/*\n```\n\n## \ud83d\ude0a Who worked on the Drug Named Entity Recognition library?\n\nThe tool was developed by:\n\n* Thomas Wood ([Fast Data Science](https://fastdatascience.com))\n\n## \ud83d\udcdcLicense of Drug Named Entity Recognition library\n\nMIT License. Copyright (c) 2023 [Fast Data Science](https://fastdatascience.com)\n\n## \u270d\ufe0f Citing the Drug Named Entity Recognition library\n\nWood, T.A., Drug Named Entity Recognition [Computer software], Version 2.0.4, accessed at [https://fastdatascience.com/drug-named-entity-recognition-python-library](https://fastdatascience.com/drug-named-entity-recognition-python-library), Fast Data Science Ltd (2024)\n\n```\n@unpublished{drugnamedentityrecognition,\n    AUTHOR = {Wood, T.A.},\n    TITLE  = {Drug Named Entity Recognition (Computer software), Version 2.0.4},\n    YEAR   = {2024},\n    Note   = {To appear},\n    url = {https://zenodo.org/doi/10.5281/zenodo.10970631},\n    doi = {10.5281/zenodo.10970631}\n}\n```\n",
    "bugtrack_url": null,
    "license": "MIT License  Copyright (c) 2023 Fast Data Science Ltd (https://fastdatascience.com). Maintainer: Thomas Wood. Tutorial at https://fastdatascience.com/drug-named-entity-recognition-python-library/  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ",
    "summary": "Drug Named Entity Recognition library to find and resolve drug names in a string (drug named entity linking)",
    "version": "2.0.4",
    "project_urls": {
        "Bug Reports": "https://github.com/fastdatascience/drug_named_entity_recognition/issues",
        "Documentation": "https://fastdatascience.com/drug-named-entity-recognition-python-library/",
        "Homepage": "https://fastdatascience.com/drug-named-entity-recognition-python-library",
        "Source Code": "https://github.com/fastdatascience/drug_named_entity_recognition"
    },
    "split_keywords": [
        "drug",
        " bio",
        " biomedical",
        " medical",
        " pharma",
        " pharmaceutical",
        " ner",
        " nlp",
        " named entity recognition",
        " natural language processing",
        " named entity linking"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5169ce21d74d1d5ab044c84f84827334016e1913683cf9ea2fa049192427a053",
                "md5": "2f781119591fb79fa5a6c28165391011",
                "sha256": "fb448a3a1916c3fbe88432e1d789f2df5a30cda83dae050da810019a0eeaf476"
            },
            "downloads": -1,
            "filename": "drug_named_entity_recognition-2.0.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "2f781119591fb79fa5a6c28165391011",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 1244818,
            "upload_time": "2024-10-14T13:34:18",
            "upload_time_iso_8601": "2024-10-14T13:34:18.045304Z",
            "url": "https://files.pythonhosted.org/packages/51/69/ce21d74d1d5ab044c84f84827334016e1913683cf9ea2fa049192427a053/drug_named_entity_recognition-2.0.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "344b675d4e9100ef3192d543eae51f90a53bbcea04143e4cdf7736d08029e584",
                "md5": "9b73dddb7054b3c3824fcda9cc1da912",
                "sha256": "ce5d3653625063899821d8a372abd833a97186bf0517a7bdcd6ffbe26ff49777"
            },
            "downloads": -1,
            "filename": "drug_named_entity_recognition-2.0.4.tar.gz",
            "has_sig": false,
            "md5_digest": "9b73dddb7054b3c3824fcda9cc1da912",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 1388285,
            "upload_time": "2024-10-14T13:34:19",
            "upload_time_iso_8601": "2024-10-14T13:34:19.783983Z",
            "url": "https://files.pythonhosted.org/packages/34/4b/675d4e9100ef3192d543eae51f90a53bbcea04143e4cdf7736d08029e584/drug_named_entity_recognition-2.0.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-14 13:34:19",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "fastdatascience",
    "github_project": "drug_named_entity_recognition",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "tox": true,
    "lcname": "drug-named-entity-recognition"
}
        
Elapsed time: 0.31549s