scispacy

Name	scispacy JSON
Version	0.5.5 JSON
	download
home_page	https://allenai.github.io/scispacy/
Summary	A full SpaCy pipeline and models for scientific/biomedical documents.
upload_time	2024-10-27 05:43:37
maintainer	None
docs_url	None
author	Allen Institute for Artificial Intelligence
requires_python	>=3.6.0
license	Apache
keywords	bioinformatics nlp spacy spacy biomedical
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <p align="center"><img width="50%" src="docs/scispacy-logo.png" /></p>


This repository contains custom pipes and models related to using spaCy for scientific documents.

In particular, there is a custom tokenizer that adds tokenization rules on top of spaCy's
rule-based tokenizer, a POS tagger and syntactic parser trained on biomedical data and
an entity span detection model. Separately, there are also NER models for more specific tasks.

**Just looking to test out the models on your data? Check out our [demo](https://scispacy.apps.allenai.org)** (Note: this demo is running an older version of scispaCy and may produce different results than the latest version).


## Installation
Installing scispacy requires two steps: installing the library and intalling the models. To install the library, run:
```bash
pip install scispacy
```

to install a model (see our full selection of available models below), run a command like the following:

```bash
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_core_sci_sm-0.5.4.tar.gz
```

Note: We strongly recommend that you use an isolated Python environment (such as virtualenv or conda) to install scispacy.
Take a look below in the "Setting up a virtual environment" section if you need some help with this.
Additionally, scispacy uses modern features of Python and as such is only available for **Python 3.6 or greater**.


### Installation note: nmslib
Over the years, installing nmslib has becomes quite difficult. There are a number of GitHub issues on scispaCy and the nmslib repo itself about this. This matrix is an attempt to help users install nmslib in whatever environment they have. I don't have access to every type of environment, so if you are able to test something out, please open an issue or pull request!

|               | Windows 11 | Windows Subsystem for Linux | Mac M1  | Mac M2  | Mac M3  | Intel Mac |
|---------------|------------|----------------------------|---------|---------|---------|-----------|
| Python 3.8    | ✅         | ✅                         | 💻      | ❓      | ❓      | ❓         |
| Python 3.9    | ❌🐍       | ✅                         | 💻      | ❓      | ❓      | ❓         |
| Python 3.10   | ❌🐍       | ✅                         | ❓      | ❓      | ❓      | ✅         |
| Python 3.11   | ❌🐍       | ❌🐍                       | ❓      | ❓      | ❓      | ❌         |
| Python 3.12   | ❌🐍       | ❌🐍🧠                     | ❓      | ❓      | ❓      | ❓         |


✅ = works normally with pip install of scispacy

❌ = does not work normally with pip install of scispacy

🐍 = can be installed with `mamba install nmslib`

💻 = can be installed with `CFLAGS="-mavx -DWARN(a)=(a)" pip install nmslib`

🧠 = can be installed with `pip install nmslib-metabrainz`

❓ = unconfirmed

Other methods mentioned in GitHub issues, but unconfirmed what versions they work for:
- `CFLAGS="-mavx -DWARN(a)=(a)" pip install nmslib`
- `pip install --no-binary :all: nmslib`
- `pip install "nmslib @ git+https://github.com/nmslib/nmslib.git/#subdirectory=python_bindings"`
- `pip install --upgrade pybind11` + `pip install --verbose 'nmslib @ git+https://github.com/nmslib/nmslib.git#egg=nmslib&subdirectory=python_bindings'`

#### Setting up a virtual environment

[Mamba](https://mamba.readthedocs.io/en/latest/) can be used set up a virtual environment with the
version of Python required for scispaCy.  If you already have a Python
environment you want to use, you can skip to the 'installing via pip' section.

1.  [Follow the installation instructions for Mamba](https://mamba.readthedocs.io/en/latest/installation/mamba-installation.html).

2.  Create a Conda environment called "scispacy" with Python 3.9 (any version >= 3.6 should work):

    ```bash
    mamba create -n scispacy python=3.10
    ```

3.  Activate the Mamba environment. You will need to activate the Conda environment in each terminal in which you want to use scispaCy.

    ```bash
    mamba activate scispacy
    ```

Now you can install `scispacy` and one of the models using the steps above.


Once you have completed the above steps and downloaded one of the models below, you can load a scispaCy model as you would any other spaCy model. For example:
```python
import spacy
nlp = spacy.load("en_core_sci_sm")
doc = nlp("Alterations in the hypocretin receptor 2 and preprohypocretin genes produce narcolepsy in some animals.")
```

#### Note on upgrading
If you are upgrading `scispacy`, you will need to download the models again, to get the model versions compatible with the version of `scispacy` that you have. The link to the model that you download should contain the version number of `scispacy` that you have.

## Available Models

To install a model, click on the link below to download the model, and then run 

```python
pip install </path/to/download>
```

Alternatively, you can install directly from the URL by right-clicking on the link, selecting "Copy Link Address" and running 
```python
pip install CMD-V(to paste the copied URL)
```

| Model          | Description       | Install URL
|:---------------|:------------------|:----------|
| en_core_sci_sm | A full spaCy pipeline for biomedical data with a ~100k vocabulary. |[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_core_sci_sm-0.5.4.tar.gz)|
| en_core_sci_md |  A full spaCy pipeline for biomedical data with a ~360k vocabulary and 50k word vectors. |[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_core_sci_md-0.5.4.tar.gz)|
| en_core_sci_lg |  A full spaCy pipeline for biomedical data with a ~785k vocabulary and 600k word vectors. |[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_core_sci_lg-0.5.4.tar.gz)|
| en_core_sci_scibert |  A full spaCy pipeline for biomedical data with a ~785k vocabulary and `allenai/scibert-base` as the transformer model. You may want to [use a GPU](https://spacy.io/usage#gpu) with this model. |[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_core_sci_scibert-0.5.4.tar.gz)|
| en_ner_craft_md|  A spaCy NER model trained on the CRAFT corpus.|[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_ner_craft_md-0.5.4.tar.gz)|
| en_ner_jnlpba_md | A spaCy NER model trained on the JNLPBA corpus.| [Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_ner_jnlpba_md-0.5.4.tar.gz)|
| en_ner_bc5cdr_md |  A spaCy NER model trained on the BC5CDR corpus. | [Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_ner_bc5cdr_md-0.5.4.tar.gz)|
| en_ner_bionlp13cg_md |  A spaCy NER model trained on the BIONLP13CG corpus. |[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_ner_bionlp13cg_md-0.5.4.tar.gz)|


## Additional Pipeline Components


### AbbreviationDetector
The AbbreviationDetector is a Spacy component which implements the abbreviation detection algorithm in "A simple algorithm
    for identifying abbreviation definitions in biomedical text.", (Schwartz & Hearst, 2003).

You can access the list of abbreviations via the `doc._.abbreviations` attribute and for a given abbreviation,
you can access it's long form (which is a `spacy.tokens.Span`) using `span._.long_form`, which will point to
another span in the document.


#### Example Usage
```python
import spacy

from scispacy.abbreviation import AbbreviationDetector

nlp = spacy.load("en_core_sci_sm")

# Add the abbreviation pipe to the spacy pipeline.
nlp.add_pipe("abbreviation_detector")

doc = nlp("Spinal and bulbar muscular atrophy (SBMA) is an \
           inherited motor neuron disease caused by the expansion \
           of a polyglutamine tract within the androgen receptor (AR). \
           SBMA can be caused by this easily.")

print("Abbreviation", "\t", "Definition")
for abrv in doc._.abbreviations:
	print(f"{abrv} \t ({abrv.start}, {abrv.end}) {abrv._.long_form}")

>>> Abbreviation	 Span	    Definition
>>> SBMA 		 (33, 34)   Spinal and bulbar muscular atrophy
>>> SBMA 	   	 (6, 7)     Spinal and bulbar muscular atrophy
>>> AR   		 (29, 30)   androgen receptor
```

> **Note**
> If you want to be able to [serialize your `doc` objects](https://spacy.io/usage/saving-loading), load the abbreviation detector with `make_serializable=True`, e.g. `nlp.add_pipe("abbreviation_detector", config={"make_serializable": True})`

### EntityLinker

The `EntityLinker` is a SpaCy component which performs linking to a knowledge base. The linker simply performs
a string overlap - based search (char-3grams) on named entities, comparing them with the concepts in a knowledge base
using an approximate nearest neighbours search.

Currently (v2.5.0), there are 5 supported linkers:

- `umls`: Links to the [Unified Medical Language System](https://www.nlm.nih.gov/research/umls/index.html), levels 0,1,2 and 9. This has ~3M concepts.
- `mesh`: Links to the [Medical Subject Headings](https://www.nlm.nih.gov/mesh/meshhome.html). This contains a smaller set of higher quality entities, which are used for indexing in Pubmed. MeSH contains ~30k entities. NOTE: The MeSH KB is derived directly from MeSH itself, and as such uses different unique identifiers than the other KBs.
- `rxnorm`: Links to the [RxNorm](https://www.nlm.nih.gov/research/umls/rxnorm/index.html) ontology. RxNorm contains ~100k concepts focused on normalized names for clinical drugs. It is comprised of several other drug vocabularies commonly used in pharmacy management and drug interaction, including First Databank, Micromedex, and the Gold Standard Drug Database.
- `go`: Links to the [Gene Ontology](http://geneontology.org/). The Gene Ontology contains ~67k concepts focused on the functions of genes.
- `hpo`: Links to the [Human Phenotype Ontology](https://hpo.jax.org/app/). The Human Phenotype Ontology contains 16k concepts focused on phenotypic abnormalities encountered in human disease.

You may want to play around with some of the parameters
below to adapt to your use case (higher precision, higher recall etc).

- `resolve_abbreviations : bool = True, optional (default = False)`
    Whether to resolve abbreviations identified in the Doc before performing linking.
    This parameter has no effect if there is no `AbbreviationDetector` in the spacy
    pipeline.
- `k : int, optional, (default = 30)`
    The number of nearest neighbours to look up from the candidate generator per mention.
- `threshold : float, optional, (default = 0.7)`
    The threshold that a mention candidate must reach to be added to the mention in the Doc
    as a mention candidate.
-   `no_definition_threshold : float, optional, (default = 0.95)`
        The threshold that a entity candidate must reach to be added to the mention in the Doc
        as a mention candidate if the entity candidate does not have a definition.
- `filter_for_definitions: bool, default = True`
    Whether to filter entities that can be returned to only include those with definitions
    in the knowledge base.
- `max_entities_per_mention : int, optional, default = 5`
    The maximum number of entities which will be returned for a given mention, regardless of
    how many are nearest neighbours are found.

This class sets the `._.kb_ents` attribute on spacy Spans, which consists of a
List[Tuple[str, float]] corresponding to the KB concept_id and the associated score
for a list of `max_entities_per_mention` number of entities.

You can look up more information for a given id using the kb attribute of this class:
```
print(linker.kb.cui_to_entity[concept_id])
```

#### Example Usage
```python
import spacy
import scispacy

from scispacy.linking import EntityLinker

nlp = spacy.load("en_core_sci_sm")

# This line takes a while, because we have to download ~1GB of data
# and load a large JSON file (the knowledge base). Be patient!
# Thankfully it should be faster after the first time you use it, because
# the downloads are cached.
# NOTE: The resolve_abbreviations parameter is optional, and requires that
# the AbbreviationDetector pipe has already been added to the pipeline. Adding
# the AbbreviationDetector pipe and setting resolve_abbreviations to True means
# that linking will only be performed on the long form of abbreviations.
nlp.add_pipe("scispacy_linker", config={"resolve_abbreviations": True, "linker_name": "umls"})

doc = nlp("Spinal and bulbar muscular atrophy (SBMA) is an \
           inherited motor neuron disease caused by the expansion \
           of a polyglutamine tract within the androgen receptor (AR). \
           SBMA can be caused by this easily.")

# Let's look at a random entity!
entity = doc.ents[1]

print("Name: ", entity)
>>> Name: bulbar muscular atrophy

# Each entity is linked to UMLS with a score
# (currently just char-3gram matching).
linker = nlp.get_pipe("scispacy_linker")
for umls_ent in entity._.kb_ents:
	print(linker.kb.cui_to_entity[umls_ent[0]])


>>> CUI: C1839259, Name: Bulbo-Spinal Atrophy, X-Linked
>>> Definition: An X-linked recessive form of spinal muscular atrophy. It is due to a mutation of the
  				gene encoding the ANDROGEN RECEPTOR.
>>> TUI(s): T047
>>> Aliases (abbreviated, total: 50):
         Bulbo-Spinal Atrophy, X-Linked, Bulbo-Spinal Atrophy, X-Linked, ....

>>> CUI: C0541794, Name: Skeletal muscle atrophy
>>> Definition: A process, occurring in skeletal muscle, that is characterized by a decrease in protein content,
                fiber diameter, force production and fatigue resistance in response to ...
>>> TUI(s): T046
>>> Aliases: (total: 9):
         Skeletal muscle atrophy, ATROPHY SKELETAL MUSCLE, skeletal muscle atrophy, ....

>>> CUI: C1447749, Name: AR protein, human
>>> Definition: Androgen receptor (919 aa, ~99 kDa) is encoded by the human AR gene.
                This protein plays a role in the modulation of steroid-dependent gene transcription.
>>> TUI(s): T116, T192
>>> Aliases (abbreviated, total: 16):
         AR protein, human, Androgen Receptor, Dihydrotestosterone Receptor, AR, DHTR, NR3C4, ...
```

### Hearst Patterns (v0.3.0 and up)

This component implements [Automatic Aquisition of Hyponyms from Large Text Corpora](https://www.aclweb.org/anthology/C92-2082.pdf) using the SpaCy Matcher component.

Passing `extended=True` to the `HyponymDetector` will use the extended set of hearst patterns, which include higher recall but lower precision hyponymy relations (e.g X compared to Y, X similar to Y, etc).

This component produces a doc level attribute on the spacy doc: `doc._.hearst_patterns`, which is a list containing tuples of extracted hyponym pairs. The tuples contain:

- The relation rule used to extract the hyponym (type: `str`)
- The more general concept  (type: `spacy.Span`)
- The more specific concept (type: `spacy.Span`)


#### Usage:

```python
import spacy
from scispacy.hyponym_detector import HyponymDetector

nlp = spacy.load("en_core_sci_sm")
nlp.add_pipe("hyponym_detector", last=True, config={"extended": False})

doc = nlp("Keystone plant species such as fig trees are good for the soil.")

print(doc._.hearst_patterns)
>>> [('such_as', Keystone plant species, fig trees)]
```


## Citing

If you use ScispaCy in your research, please cite [ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing](https://www.semanticscholar.org/paper/ScispaCy%3A-Fast-and-Robust-Models-for-Biomedical-Neumann-King/de28ec1d7bd38c8fc4e8ac59b6133800818b4e29). Additionally, please indicate which version and model of ScispaCy you used so that your research can be reproduced.
```
@inproceedings{neumann-etal-2019-scispacy,
    title = "{S}cispa{C}y: {F}ast and {R}obust {M}odels for {B}iomedical {N}atural {L}anguage {P}rocessing",
    author = "Neumann, Mark  and
      King, Daniel  and
      Beltagy, Iz  and
      Ammar, Waleed",
    booktitle = "Proceedings of the 18th BioNLP Workshop and Shared Task",
    month = aug,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/W19-5034",
    doi = "10.18653/v1/W19-5034",
    pages = "319--327",
    eprint = {arXiv:1902.07669},
    abstract = "Despite recent advances in natural language processing, many statistical models for processing text perform extremely poorly under domain shift. Processing biomedical and clinical text is a critically important application area of natural language processing, for which there are few robust, practical, publicly available models. This paper describes scispaCy, a new Python library and models for practical biomedical/scientific text processing, which heavily leverages the spaCy library. We detail the performance of two packages of models released in scispaCy and demonstrate their robustness on several tasks and datasets. Models and code are available at https://allenai.github.io/scispacy/.",
}
```

ScispaCy is an open-source project developed by [the Allen Institute for Artificial Intelligence (AI2)](http://www.allenai.org).
AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.

Raw data

            {
    "_id": null,
    "home_page": "https://allenai.github.io/scispacy/",
    "name": "scispacy",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.6.0",
    "maintainer_email": null,
    "keywords": "bioinformatics nlp spacy SpaCy biomedical",
    "author": "Allen Institute for Artificial Intelligence",
    "author_email": "ai2-info@allenai.org",
    "download_url": "https://files.pythonhosted.org/packages/13/ec/541b68fba77dbcdcfcbea51bfe6c81652279bdb55a3135b8392421921a18/scispacy-0.5.5.tar.gz",
    "platform": null,
    "description": "<p align=\"center\"><img width=\"50%\" src=\"docs/scispacy-logo.png\" /></p>\n\n\nThis repository contains custom pipes and models related to using spaCy for scientific documents.\n\nIn particular, there is a custom tokenizer that adds tokenization rules on top of spaCy's\nrule-based tokenizer, a POS tagger and syntactic parser trained on biomedical data and\nan entity span detection model. Separately, there are also NER models for more specific tasks.\n\n**Just looking to test out the models on your data? Check out our [demo](https://scispacy.apps.allenai.org)** (Note: this demo is running an older version of scispaCy and may produce different results than the latest version).\n\n\n## Installation\nInstalling scispacy requires two steps: installing the library and intalling the models. To install the library, run:\n```bash\npip install scispacy\n```\n\nto install a model (see our full selection of available models below), run a command like the following:\n\n```bash\npip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_core_sci_sm-0.5.4.tar.gz\n```\n\nNote: We strongly recommend that you use an isolated Python environment (such as virtualenv or conda) to install scispacy.\nTake a look below in the \"Setting up a virtual environment\" section if you need some help with this.\nAdditionally, scispacy uses modern features of Python and as such is only available for **Python 3.6 or greater**.\n\n\n### Installation note: nmslib\nOver the years, installing nmslib has becomes quite difficult. There are a number of GitHub issues on scispaCy and the nmslib repo itself about this. This matrix is an attempt to help users install nmslib in whatever environment they have. I don't have access to every type of environment, so if you are able to test something out, please open an issue or pull request!\n\n|               | Windows 11 | Windows Subsystem for Linux | Mac M1  | Mac M2  | Mac M3  | Intel Mac |\n|---------------|------------|----------------------------|---------|---------|---------|-----------|\n| Python 3.8    | \u2705         | \u2705                         | \ud83d\udcbb      | \u2753      | \u2753      | \u2753         |\n| Python 3.9    | \u274c\ud83d\udc0d       | \u2705                         | \ud83d\udcbb      | \u2753      | \u2753      | \u2753         |\n| Python 3.10   | \u274c\ud83d\udc0d       | \u2705                         | \u2753      | \u2753      | \u2753      | \u2705         |\n| Python 3.11   | \u274c\ud83d\udc0d       | \u274c\ud83d\udc0d                       | \u2753      | \u2753      | \u2753      | \u274c         |\n| Python 3.12   | \u274c\ud83d\udc0d       | \u274c\ud83d\udc0d\ud83e\udde0                     | \u2753      | \u2753      | \u2753      | \u2753         |\n\n\n\u2705 = works normally with pip install of scispacy\n\n\u274c = does not work normally with pip install of scispacy\n\n\ud83d\udc0d = can be installed with `mamba install nmslib`\n\n\ud83d\udcbb = can be installed with `CFLAGS=\"-mavx -DWARN(a)=(a)\" pip install nmslib`\n\n\ud83e\udde0 = can be installed with `pip install nmslib-metabrainz`\n\n\u2753 = unconfirmed\n\nOther methods mentioned in GitHub issues, but unconfirmed what versions they work for:\n- `CFLAGS=\"-mavx -DWARN(a)=(a)\" pip install nmslib`\n- `pip install --no-binary :all: nmslib`\n- `pip install \"nmslib @ git+https://github.com/nmslib/nmslib.git/#subdirectory=python_bindings\"`\n- `pip install --upgrade pybind11` + `pip install --verbose 'nmslib @ git+https://github.com/nmslib/nmslib.git#egg=nmslib&subdirectory=python_bindings'`\n\n#### Setting up a virtual environment\n\n[Mamba](https://mamba.readthedocs.io/en/latest/) can be used set up a virtual environment with the\nversion of Python required for scispaCy.  If you already have a Python\nenvironment you want to use, you can skip to the 'installing via pip' section.\n\n1.  [Follow the installation instructions for Mamba](https://mamba.readthedocs.io/en/latest/installation/mamba-installation.html).\n\n2.  Create a Conda environment called \"scispacy\" with Python 3.9 (any version >= 3.6 should work):\n\n    ```bash\n    mamba create -n scispacy python=3.10\n    ```\n\n3.  Activate the Mamba environment. You will need to activate the Conda environment in each terminal in which you want to use scispaCy.\n\n    ```bash\n    mamba activate scispacy\n    ```\n\nNow you can install `scispacy` and one of the models using the steps above.\n\n\nOnce you have completed the above steps and downloaded one of the models below, you can load a scispaCy model as you would any other spaCy model. For example:\n```python\nimport spacy\nnlp = spacy.load(\"en_core_sci_sm\")\ndoc = nlp(\"Alterations in the hypocretin receptor 2 and preprohypocretin genes produce narcolepsy in some animals.\")\n```\n\n#### Note on upgrading\nIf you are upgrading `scispacy`, you will need to download the models again, to get the model versions compatible with the version of `scispacy` that you have. The link to the model that you download should contain the version number of `scispacy` that you have.\n\n## Available Models\n\nTo install a model, click on the link below to download the model, and then run \n\n```python\npip install </path/to/download>\n```\n\nAlternatively, you can install directly from the URL by right-clicking on the link, selecting \"Copy Link Address\" and running \n```python\npip install CMD-V(to paste the copied URL)\n```\n\n| Model          | Description       | Install URL\n|:---------------|:------------------|:----------|\n| en_core_sci_sm | A full spaCy pipeline for biomedical data with a ~100k vocabulary. |[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_core_sci_sm-0.5.4.tar.gz)|\n| en_core_sci_md |  A full spaCy pipeline for biomedical data with a ~360k vocabulary and 50k word vectors. |[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_core_sci_md-0.5.4.tar.gz)|\n| en_core_sci_lg |  A full spaCy pipeline for biomedical data with a ~785k vocabulary and 600k word vectors. |[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_core_sci_lg-0.5.4.tar.gz)|\n| en_core_sci_scibert |  A full spaCy pipeline for biomedical data with a ~785k vocabulary and `allenai/scibert-base` as the transformer model. You may want to [use a GPU](https://spacy.io/usage#gpu) with this model. |[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_core_sci_scibert-0.5.4.tar.gz)|\n| en_ner_craft_md|  A spaCy NER model trained on the CRAFT corpus.|[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_ner_craft_md-0.5.4.tar.gz)|\n| en_ner_jnlpba_md | A spaCy NER model trained on the JNLPBA corpus.| [Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_ner_jnlpba_md-0.5.4.tar.gz)|\n| en_ner_bc5cdr_md |  A spaCy NER model trained on the BC5CDR corpus. | [Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_ner_bc5cdr_md-0.5.4.tar.gz)|\n| en_ner_bionlp13cg_md |  A spaCy NER model trained on the BIONLP13CG corpus. |[Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_ner_bionlp13cg_md-0.5.4.tar.gz)|\n\n\n## Additional Pipeline Components\n\n\n### AbbreviationDetector\nThe AbbreviationDetector is a Spacy component which implements the abbreviation detection algorithm in \"A simple algorithm\n    for identifying abbreviation definitions in biomedical text.\", (Schwartz & Hearst, 2003).\n\nYou can access the list of abbreviations via the `doc._.abbreviations` attribute and for a given abbreviation,\nyou can access it's long form (which is a `spacy.tokens.Span`) using `span._.long_form`, which will point to\nanother span in the document.\n\n\n#### Example Usage\n```python\nimport spacy\n\nfrom scispacy.abbreviation import AbbreviationDetector\n\nnlp = spacy.load(\"en_core_sci_sm\")\n\n# Add the abbreviation pipe to the spacy pipeline.\nnlp.add_pipe(\"abbreviation_detector\")\n\ndoc = nlp(\"Spinal and bulbar muscular atrophy (SBMA) is an \\\n           inherited motor neuron disease caused by the expansion \\\n           of a polyglutamine tract within the androgen receptor (AR). \\\n           SBMA can be caused by this easily.\")\n\nprint(\"Abbreviation\", \"\\t\", \"Definition\")\nfor abrv in doc._.abbreviations:\n\tprint(f\"{abrv} \\t ({abrv.start}, {abrv.end}) {abrv._.long_form}\")\n\n>>> Abbreviation\t Span\t    Definition\n>>> SBMA \t\t (33, 34)   Spinal and bulbar muscular atrophy\n>>> SBMA \t   \t (6, 7)     Spinal and bulbar muscular atrophy\n>>> AR   \t\t (29, 30)   androgen receptor\n```\n\n> **Note**\n> If you want to be able to [serialize your `doc` objects](https://spacy.io/usage/saving-loading), load the abbreviation detector with `make_serializable=True`, e.g. `nlp.add_pipe(\"abbreviation_detector\", config={\"make_serializable\": True})`\n\n### EntityLinker\n\nThe `EntityLinker` is a SpaCy component which performs linking to a knowledge base. The linker simply performs\na string overlap - based search (char-3grams) on named entities, comparing them with the concepts in a knowledge base\nusing an approximate nearest neighbours search.\n\nCurrently (v2.5.0), there are 5 supported linkers:\n\n- `umls`: Links to the [Unified Medical Language System](https://www.nlm.nih.gov/research/umls/index.html), levels 0,1,2 and 9. This has ~3M concepts.\n- `mesh`: Links to the [Medical Subject Headings](https://www.nlm.nih.gov/mesh/meshhome.html). This contains a smaller set of higher quality entities, which are used for indexing in Pubmed. MeSH contains ~30k entities. NOTE: The MeSH KB is derived directly from MeSH itself, and as such uses different unique identifiers than the other KBs.\n- `rxnorm`: Links to the [RxNorm](https://www.nlm.nih.gov/research/umls/rxnorm/index.html) ontology. RxNorm contains ~100k concepts focused on normalized names for clinical drugs. It is comprised of several other drug vocabularies commonly used in pharmacy management and drug interaction, including First Databank, Micromedex, and the Gold Standard Drug Database.\n- `go`: Links to the [Gene Ontology](http://geneontology.org/). The Gene Ontology contains ~67k concepts focused on the functions of genes.\n- `hpo`: Links to the [Human Phenotype Ontology](https://hpo.jax.org/app/). The Human Phenotype Ontology contains 16k concepts focused on phenotypic abnormalities encountered in human disease.\n\nYou may want to play around with some of the parameters\nbelow to adapt to your use case (higher precision, higher recall etc).\n\n- `resolve_abbreviations : bool = True, optional (default = False)`\n    Whether to resolve abbreviations identified in the Doc before performing linking.\n    This parameter has no effect if there is no `AbbreviationDetector` in the spacy\n    pipeline.\n- `k : int, optional, (default = 30)`\n    The number of nearest neighbours to look up from the candidate generator per mention.\n- `threshold : float, optional, (default = 0.7)`\n    The threshold that a mention candidate must reach to be added to the mention in the Doc\n    as a mention candidate.\n-   `no_definition_threshold : float, optional, (default = 0.95)`\n        The threshold that a entity candidate must reach to be added to the mention in the Doc\n        as a mention candidate if the entity candidate does not have a definition.\n- `filter_for_definitions: bool, default = True`\n    Whether to filter entities that can be returned to only include those with definitions\n    in the knowledge base.\n- `max_entities_per_mention : int, optional, default = 5`\n    The maximum number of entities which will be returned for a given mention, regardless of\n    how many are nearest neighbours are found.\n\nThis class sets the `._.kb_ents` attribute on spacy Spans, which consists of a\nList[Tuple[str, float]] corresponding to the KB concept_id and the associated score\nfor a list of `max_entities_per_mention` number of entities.\n\nYou can look up more information for a given id using the kb attribute of this class:\n```\nprint(linker.kb.cui_to_entity[concept_id])\n```\n\n#### Example Usage\n```python\nimport spacy\nimport scispacy\n\nfrom scispacy.linking import EntityLinker\n\nnlp = spacy.load(\"en_core_sci_sm\")\n\n# This line takes a while, because we have to download ~1GB of data\n# and load a large JSON file (the knowledge base). Be patient!\n# Thankfully it should be faster after the first time you use it, because\n# the downloads are cached.\n# NOTE: The resolve_abbreviations parameter is optional, and requires that\n# the AbbreviationDetector pipe has already been added to the pipeline. Adding\n# the AbbreviationDetector pipe and setting resolve_abbreviations to True means\n# that linking will only be performed on the long form of abbreviations.\nnlp.add_pipe(\"scispacy_linker\", config={\"resolve_abbreviations\": True, \"linker_name\": \"umls\"})\n\ndoc = nlp(\"Spinal and bulbar muscular atrophy (SBMA) is an \\\n           inherited motor neuron disease caused by the expansion \\\n           of a polyglutamine tract within the androgen receptor (AR). \\\n           SBMA can be caused by this easily.\")\n\n# Let's look at a random entity!\nentity = doc.ents[1]\n\nprint(\"Name: \", entity)\n>>> Name: bulbar muscular atrophy\n\n# Each entity is linked to UMLS with a score\n# (currently just char-3gram matching).\nlinker = nlp.get_pipe(\"scispacy_linker\")\nfor umls_ent in entity._.kb_ents:\n\tprint(linker.kb.cui_to_entity[umls_ent[0]])\n\n\n>>> CUI: C1839259, Name: Bulbo-Spinal Atrophy, X-Linked\n>>> Definition: An X-linked recessive form of spinal muscular atrophy. It is due to a mutation of the\n  \t\t\t\tgene encoding the ANDROGEN RECEPTOR.\n>>> TUI(s): T047\n>>> Aliases (abbreviated, total: 50):\n         Bulbo-Spinal Atrophy, X-Linked, Bulbo-Spinal Atrophy, X-Linked, ....\n\n>>> CUI: C0541794, Name: Skeletal muscle atrophy\n>>> Definition: A process, occurring in skeletal muscle, that is characterized by a decrease in protein content,\n                fiber diameter, force production and fatigue resistance in response to ...\n>>> TUI(s): T046\n>>> Aliases: (total: 9):\n         Skeletal muscle atrophy, ATROPHY SKELETAL MUSCLE, skeletal muscle atrophy, ....\n\n>>> CUI: C1447749, Name: AR protein, human\n>>> Definition: Androgen receptor (919 aa, ~99 kDa) is encoded by the human AR gene.\n                This protein plays a role in the modulation of steroid-dependent gene transcription.\n>>> TUI(s): T116, T192\n>>> Aliases (abbreviated, total: 16):\n         AR protein, human, Androgen Receptor, Dihydrotestosterone Receptor, AR, DHTR, NR3C4, ...\n```\n\n### Hearst Patterns (v0.3.0 and up)\n\nThis component implements [Automatic Aquisition of Hyponyms from Large Text Corpora](https://www.aclweb.org/anthology/C92-2082.pdf) using the SpaCy Matcher component.\n\nPassing `extended=True` to the `HyponymDetector` will use the extended set of hearst patterns, which include higher recall but lower precision hyponymy relations (e.g X compared to Y, X similar to Y, etc).\n\nThis component produces a doc level attribute on the spacy doc: `doc._.hearst_patterns`, which is a list containing tuples of extracted hyponym pairs. The tuples contain:\n\n- The relation rule used to extract the hyponym (type: `str`)\n- The more general concept  (type: `spacy.Span`)\n- The more specific concept (type: `spacy.Span`)\n\n\n#### Usage:\n\n```python\nimport spacy\nfrom scispacy.hyponym_detector import HyponymDetector\n\nnlp = spacy.load(\"en_core_sci_sm\")\nnlp.add_pipe(\"hyponym_detector\", last=True, config={\"extended\": False})\n\ndoc = nlp(\"Keystone plant species such as fig trees are good for the soil.\")\n\nprint(doc._.hearst_patterns)\n>>> [('such_as', Keystone plant species, fig trees)]\n```\n\n\n## Citing\n\nIf you use ScispaCy in your research, please cite [ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing](https://www.semanticscholar.org/paper/ScispaCy%3A-Fast-and-Robust-Models-for-Biomedical-Neumann-King/de28ec1d7bd38c8fc4e8ac59b6133800818b4e29). Additionally, please indicate which version and model of ScispaCy you used so that your research can be reproduced.\n```\n@inproceedings{neumann-etal-2019-scispacy,\n    title = \"{S}cispa{C}y: {F}ast and {R}obust {M}odels for {B}iomedical {N}atural {L}anguage {P}rocessing\",\n    author = \"Neumann, Mark  and\n      King, Daniel  and\n      Beltagy, Iz  and\n      Ammar, Waleed\",\n    booktitle = \"Proceedings of the 18th BioNLP Workshop and Shared Task\",\n    month = aug,\n    year = \"2019\",\n    address = \"Florence, Italy\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://www.aclweb.org/anthology/W19-5034\",\n    doi = \"10.18653/v1/W19-5034\",\n    pages = \"319--327\",\n    eprint = {arXiv:1902.07669},\n    abstract = \"Despite recent advances in natural language processing, many statistical models for processing text perform extremely poorly under domain shift. Processing biomedical and clinical text is a critically important application area of natural language processing, for which there are few robust, practical, publicly available models. This paper describes scispaCy, a new Python library and models for practical biomedical/scientific text processing, which heavily leverages the spaCy library. We detail the performance of two packages of models released in scispaCy and demonstrate their robustness on several tasks and datasets. Models and code are available at https://allenai.github.io/scispacy/.\",\n}\n```\n\nScispaCy is an open-source project developed by [the Allen Institute for Artificial Intelligence (AI2)](http://www.allenai.org).\nAI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.\n\n\n\n",
    "bugtrack_url": null,
    "license": "Apache",
    "summary": "A full SpaCy pipeline and models for scientific/biomedical documents.",
    "version": "0.5.5",
    "project_urls": {
        "Homepage": "https://allenai.github.io/scispacy/"
    },
    "split_keywords": [
        "bioinformatics",
        "nlp",
        "spacy",
        "spacy",
        "biomedical"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4b4e790d070d25dad45555262fca6a728929b7b88642831197df65480e49b828",
                "md5": "aa68ec520e99cddc751eff66c0b31f34",
                "sha256": "81b6b3d899a4caf224ebe3016eae25a112b259c1941cb62703decea7897a0ab8"
            },
            "downloads": -1,
            "filename": "scispacy-0.5.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "aa68ec520e99cddc751eff66c0b31f34",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6.0",
            "size": 46231,
            "upload_time": "2024-10-27T05:43:35",
            "upload_time_iso_8601": "2024-10-27T05:43:35.771507Z",
            "url": "https://files.pythonhosted.org/packages/4b/4e/790d070d25dad45555262fca6a728929b7b88642831197df65480e49b828/scispacy-0.5.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "13ec541b68fba77dbcdcfcbea51bfe6c81652279bdb55a3135b8392421921a18",
                "md5": "a9c6c93a85a34f77ef6d31fdba49177b",
                "sha256": "a56607572d31b7ab0b1e5d4b7836ca1f319bb402203eef52c50dfe9ed35de60e"
            },
            "downloads": -1,
            "filename": "scispacy-0.5.5.tar.gz",
            "has_sig": false,
            "md5_digest": "a9c6c93a85a34f77ef6d31fdba49177b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6.0",
            "size": 48531,
            "upload_time": "2024-10-27T05:43:37",
            "upload_time_iso_8601": "2024-10-27T05:43:37.860368Z",
            "url": "https://files.pythonhosted.org/packages/13/ec/541b68fba77dbcdcfcbea51bfe6c81652279bdb55a3135b8392421921a18/scispacy-0.5.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-27 05:43:37",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "scispacy"
}

Allen Institute for Artificial Intelligence