taxonerd

Name	taxonerd JSON
Version	1.5.4 JSON
	download
home_page	https://github.com/nleguillarme/taxonerd
Summary	A Python package and CLI tool based on spaCy for detecting mentions of taxonomic entities in text
upload_time	2024-06-12 08:11:47
maintainer	None
docs_url	None
author	None
requires_python	None
license	MIT License
keywords	spacy ner transformers deep neural networks ecology evolution
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            ![](https://i.ibb.co/G09fX98/taxonerd-logo.png)

Looking for taxon mentions in text? Ask TaxoNERD

* [Features](#features)
* [Models](#models)
* [Installation](#installation)
* [Usage](#usage)
* [Extensions](#extensions)

I would be happy to hear about your use of TaxoNERD : what is your use case? How did TaxoNERD help you? What could make TaxoNERD even more helpful? Please feel free to drop me an email (nicolas[dot]leguillarme[at]univ-grenoble-alpes[dot]fr) or to open an issue.

## Cite TaxoNERD

Le Guillarme, N., & Thuiller, W. (2022). [TaxoNERD: deep neural models for the recognition of taxonomic entities in the ecological and evolutionary literature](https://doi.org/10.1111/2041-210X.13778). Methods in Ecology and Evolution, 13(3), 625-641.

## Features

TaxoNERD is a domain-specific tool for recognizing taxon mentions in the biodiversity literature.

:tada: **It is now possible to use custom taxonomies for entity linking ! Check our [example Notebook](https://github.com/nleguillarme/taxonerd/blob/61ff9792628214a5524b0f2cc5c205d4ca82bfc9/extensions/custom_taxonomy.ipynb)**

* TaxoNERD is available as a command-line tool, a Python module, a spaCy pipeline, **and a R package thanks to reticulate**.
* TaxoNERD provides two architectures : the **md** architecture uses spaCy's standard Tok2Vec layer with word vectors for speed, while the **biobert** architecture uses a Transformer-based pretrained language model (dmis-lab/biobert-v1.1) for accuracy.
* TaxoNERD finds scientific names, common names, abbreviated species names and user-defined abbreviations.
* TaxoNERD can link taxon mentions to entities in a reference taxonomy (NCBI Taxonomy, GBIF Backbone and TAXREF at the moment, more to come).
* TaxoNERD is fast (once the model is loaded), and can run on CPU or GPU.
* Entity linking does not need an internet connection, but may require a lot of RAM depending on the size of the taxonomy (e.g. GBIF Backbone -> ~12.5Gb).
* Thanks to [textract](https://textract.readthedocs.io/en/stable/), **TaxoNERD can extract taxon mentions from (almost) any document** (including txt, pdf, csv, xls, jpg, png, and many other formats). With TaxoNERD, the detection of taxonomic entities in a JPG file is as simple as that:

<img width="50%" align="left" src="https://github.com/nleguillarme/taxonerd/raw/main/tests/test_data/test_jpg/test.jpg">


``` console
taxonerd ask -m en_ner_eco_biobert_weak -f ./tests/test_data/test_jpg/test.jpg 
T0	LIVB 158 165	species
T1	LIVB 180 192	Harbour seal
T2	LIVB 194 208	Phoca vitulina
T3	LIVB 361 375	Pacific salmon
T4	LIVB 377 394	Oncorhynchus spp.
T5	LIVB 455 467	harbour seal
T6	LIVB 663 670	species
T7	LIVB 793 805	harbour seal
T8	LIVB 1114 1121	species
T9	LIVB 1127 1133	fishes
T10	LIVB 1137 1148	cephalopods
```


## Models

| Model               |      Description      |  Install URL |
|---------------------|-------------|------:|
| en_ner_eco_md       | A spaCy NER model with 50k word vectors (taken from [en_core_sci_md](https://allenai.github.io/scispacy/)), fine-tuned on an ecological corpus. | [Download](https://github.com/nleguillarme/taxonerd/releases/download/v1.5.4/en_ner_eco_md-1.1.0.tar.gz)      |
| en_ner_eco_biobert | A spaCy NER model with dmis-lab/biobert-v1.1 as the transformer model, fine-tuned on an ecological corpus. | [Download](https://github.com/nleguillarme/taxonerd/releases/download/v1.5.4/en_ner_eco_biobert-1.1.0.tar.gz) |
| en_core_eco_weak_md | A spaCy NER model with 50k word vectors (taken from [en_core_sci_md](https://allenai.github.io/scispacy/)), fine-tuned on a silver standard corpus (for improved performance on vernacular names). | [Download](https://github.com/nleguillarme/taxonerd/releases/download/v1.5.4/en_ner_eco_md_weak-1.1.0.tar.gz)    |
| en_core_eco_weak_biobert | A spaCy NER model with dmis-lab/biobert-v1.1 as the transformer model, fine-tuned on a silver standard corpus (for improved performance on vernacular names). | [Download](https://github.com/nleguillarme/taxonerd/releases/download/v1.5.4/en_ner_eco_biobert_weak-1.1.0.tar.gz) |

### What model should I choose ?

If you have access to a GPU, we recommend using one of the biobert models as they tend to be more accurate than the md models.

The en_core_eco_weak_md and en_core_eco_weak_biobert have been fine-tuned on a silver standard corpus generated using weak supervision. Therefore, they have been trained on a much larger amount of (noisy) data than their gold standard counterparts. As a result, they tend to have better recall, especially with respect to common names detection. They also have high precision. Nevertheless, their performance has not been accurately evaluated.

If you do not trust weakly-supervised data and you are not really interested in detecting common names, en_core_eco_md and en_core_eco_biobert are for you. These models have been fine-tuned on a gold standard corpus (a combination of [COPIOUS](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6351503/), [Bacteria Biotope 2019](https://aclanthology.org/D19-5719/), and [BiodivNERE](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9836593/)) and their performance has been benchmarked in our paper.

## Installation

### TaxoNERD for Python

Installing the package from pip will automatically install all dependencies, including pandas, spaCy, scispaCy and textract. Make sure you install this package before you install the models. Also note that this package requires Python 3.10 and spaCy v3.7.

    $ pip install taxonerd

For GPU support, find your CUDA version using `nvcc --version` and add the version in brackets, e.g. `pip install taxonerd[cuda12x]` for CUDA 12.1. See [setup.cfg](setup.cfg) for supported CUDA versions.

To download the models:

    $ pip install https://github.com/nleguillarme/taxonerd/releases/download/v1.5.4/en_ner_eco_md-1.1.0.tar.gz
    $ pip install https://github.com/nleguillarme/taxonerd/releases/download/v1.5.4/en_ner_eco_biobert-1.1.0.tar.gz
    $ pip install https://github.com/nleguillarme/taxonerd/releases/download/v1.5.4/en_ner_eco_md_weak-1.1.0.tar.gz
    $ pip install https://github.com/nleguillarme/taxonerd/releases/download/v1.5.4/en_ner_eco_biobert_weak-1.1.0.tar.gz

Entity linker files are downloaded and cached the first time the linker is used. This may take some time, but it should only be done once. Currently (v1.5.4), there are 3 supported linkers:

* gbif_backbone: Links to [GBIF Backbone Taxonomy (2023-08-28)](https://www.gbif.org/fr/dataset/d7dddbf4-2cf0-4f39-9b2a-bb099caae36c) (~9.5M names for ~3.5M taxa).
* taxref: Links to [TAXREF (v17)](https://inpn.mnhn.fr/telechargement/referentielEspece/taxref/17.0/menu) (~1.2M names for ~267k taxa).
* ncbi_taxonomy: Links to [The NCBI Taxonomy (2024-05-22)](https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/) (~3.4M names).
<!-- * ncbi_taxonomy_lite: Links to [The NCBI Taxonomy](https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/) from which we removed virus names and added abreviated species name (e.g. *P. marina*) (~3.5M names). The ncbi_taxonomy_lite linker supports abbreviated species names out-of-the-box. This means that even if you do not use the abbreviation detector, abbreviated species names such as *P. marina* can be linked to the corresponding taxonomic unit *Pirellula marina* (NCBI:214). -->

### TaxoNERD for R

    > install.packages("https://github.com/nleguillarme/taxonerd/releases/download/v1.5.4/taxonerd_for_R_1.5.4.tar.gz", repos=NULL)
    > vignette("taxonerd") # See vignette for more information on how to install and use TaxoNERD for R

## Usage

TaxoNERD can be used as:
* [a command-line tool](#use-as-command-line-tool)
* [a Python module](#use-as-python-module)
* [a spaCy pipeline](#use-as-spacy-pipeline)

### Use as command-line tool

``` console
$ taxonerd ask --help
Usage: taxonerd ask [OPTIONS] [INPUT_TEXT]

Options:
  -m, --model TEXT       A TaxoNERD model [default = en_ner_eco_md]
  -i, --input-dir TEXT   Input directory
  -o, --output-dir TEXT  Output directory
  -f, --filename TEXT    Input text file
  -a, --with-abbrev      Add abbreviation detector to the pipeline
  -s, --with-sentence    Add sentence segmenter to the pipeline
  -l, --link-to TEXT     Add entity linker to the pipeline
  -t, --thresh FLOAT     Similarity threshold for entity linking [default =
                         0.7]

  --prefer-gpu           Use GPU if available
  -v, --verbose          Verbose mode
  --help                 Show this message and exit.
```

  #### Examples

  ##### Taxonomic NER from the terminal

``` console
$ taxonerd ask -m en_ner_eco_biobert "Brown bears (Ursus arctos), which are widely distributed throughout the northern hemisphere, are recognised as opportunistic omnivores"
T0	LIVB 0 11	Brown bears
T1	LIVB 13 25	Ursus arctos
```

  ##### Taxonomic NER with entity linking from the terminal

``` console
$ taxonerd ask -m en_ner_eco_biobert -l gbif_backbone "Brown bears (Ursus arctos), which are widely distributed throughout the northern hemisphere, are recognised as opportunistic omnivores"
T0	LIVB 0 11	Brown bears	[('GBIF:2433433', 'Brown Bear', 0.8313919901847839)]
T1	LIVB 13 25	Ursus arctos	[('GBIF:2433433', 'Ursus arctos', 1.0)]

$ taxonerd ask -m en_ner_eco_biobert -l gbif_backbone -t 0.85 "Brown bears (Ursus arctos), which are widely distributed throughout the northern hemisphere, are recognised as opportunistic omnivores"
T0	LIVB 13 25	Ursus arctos	[('GBIF:2433433', 'Ursus arctos', 1.0)]
```

  ##### Taxonomic NER from a text file (with abbreviation detection)

``` console
$ taxonerd ask -m en_ner_eco_biobert --with-abbrev -f ./tests/test_data/test_txt/test1.txt
T0	LIVB 4 21	pinewood nematode
T1	LIVB 23 26	PWN
T2	LIVB 29 55	Bursaphelenchus xylophilus
T3	LIVB 57 70	B. xylophilus
T4	LIVB 99 108	pine wilt
T5	LIVB 196 204	Serratia
T6	LIVB 257 260	PWN
T7	LIVB 342 364	Serratia grimesii BXF1
T8	LIVB 387 390	PWN
T9	LIVB 440 444	BXF1
```

  ##### Taxonomic NER from a directory containing text files, with results written in the output directory

``` console
$ taxonerd ask --focus-on accuracy -i ./tests/test_data/test_txt -o test_ann
$ ls test_ann/
test1.ann  test2.ann
$ cat test_ann/test2.ann
T0	LIVB 700 711	Brown bears
T1	LIVB 713 725	Ursus arctos
T2	LIVB 1062 1073	brown bears
T3	LIVB 1161 1172	brown bears
T4	LIVB 1339 1350	brown bears
T5	LIVB 1555 1565	brown bear
T6	LIVB 1782 1793	brown bears
T7	LIVB 1863 1874	brown bears
T8	LIVB 1958 1969	brown bears
T9	LIVB 1974 1980	salmon
T10	LIVB 2026 2037	brown bears
T11	LIVB 2219 2230	brown bears
T12	LIVB 2392 2401	Sika deer
T13	LIVB 2403 2416	Cervus nippon
T14	LIVB 2555 2559	deer
T15	LIVB 2594 2604	brown bear
T16	LIVB 2798 2808	brown bear
T17	LIVB 3146 3150	deer
T18	LIVB 3188 3199	chum salmon
T19	LIVB 3201 3218	Oncorhynchus keta
T20	LIVB 3280 3289	Sika deer
T21	LIVB 3350 3361	pink salmon
T22	LIVB 3363 3375	O. gorbuscha
T23	LIVB 3381 3392	chum salmon
T24	LIVB 3518 3528	Brown bear
T25	LIVB 4001 4012	brown bears
T26	LIVB 4071 4082	brown bears
```

### Use as python module

``` python
>>> from taxonerd import TaxoNERD
>>> taxonerd = TaxoNERD(prefer_gpu=False)
>>> nlp = taxonerd.load(model="en_ner_eco_md", exclude=[], linker="taxref", threshold=0.7)
>>> nlp.pipe_names
['tok2vec', 'tagger', 'attribute_ruler', 'lemmatizer', 'pysbd_sentencizer', 'parser', 'ner', 'taxo_abbrev_detector', 'taxon_linker']
```

**N.B.** By default, all components are included in the pipeline. Use the ``exclude`` argument to specify the components to exclude. Excluded components won’t be loaded. This may speed up the detection process. The minimal pipeline for taxonomic NER is ``['tok2vec', 'ner']``.

#### Examples

  ##### Find taxonomic entities in an input string

``` python
>>> taxonerd.find_in_text("Brown bears (Ursus arctos), which are widely distributed throughout the northern hemisphere, are recognised as opportunistic omnivore")
      offsets           text                               entity  sent
T0  LIVB 13 25  Ursus arctos  [(TAXREF:60826, Ursus arctos, 1.0)]     0
```

  ##### Find taxonomic entities in an input file

``` python
>>> taxonerd.find_in_file("./tests/test_data/test_txt/test2.txt", output_dir=None)
            offsets               text                                             entity  sent
T0     LIVB 713 725       Ursus arctos                [(TAXREF:60826, Ursus arctos, 1.0)]     4
T1   LIVB 1974 1980             salmon      [(TAXREF:730671, Salmonia, 0.85158771276474)]    12
T2   LIVB 2392 2401          Sika deer                   [(TAXREF:61025, Sika Deer, 1.0)]    14
T3   LIVB 2403 2416      Cervus nippon               [(TAXREF:61025, Cervus nippon, 1.0)]    14
T4   LIVB 3135 3141             salmon      [(TAXREF:730671, Salmonia, 0.85158771276474)]    18
T5   LIVB 3146 3150               deer                       [(TAXREF:186210, deer, 1.0)]    18
T6   LIVB 3188 3199        chum salmon    [(TAXREF:730671, Salmonia, 0.7018352746963501)]    19
T7   LIVB 3201 3218  Oncorhynchus keta  [(TAXREF:195439, Oncorhynchus, 0.8319146037101...    19
T8   LIVB 3280 3289          Sika deer                   [(TAXREF:61025, Sika Deer, 1.0)]    19
T9   LIVB 3350 3361        pink salmon                 [(TAXREF:67798, Pink Salmon, 1.0)]    20
T10  LIVB 3381 3392        chum salmon    [(TAXREF:730671, Salmonia, 0.7018352746963501)]    20
T11  LIVB 3481 3485               deer                       [(TAXREF:186210, deer, 1.0)]    20
```

  ##### Find taxonomic entities in all the files in the input directory, and write the results in the output directory

``` python
>>> taxonerd.find_in_corpus("./tests/test_data/test_txt", "./test_ann")
{'test1.txt': './test_ann/test1.ann', 'test2.txt': './test_ann/test2.ann'}
```

### Use as spaCy pipeline
``` python
>>> from taxonerd import TaxoNERD
>>> taxonerd = TaxoNERD(prefer_gpu=True)
>>> nlp = taxonerd.load(model="en_ner_eco_biobert")
>>> doc = nlp("Brown bears (Ursus arctos), which are widely distributed throughout the northern hemisphere, are recognised as opportunistic omnivore")
>>> doc.ents
(Brown bears, Ursus arctos)
>>> [tok.lemma_ for tok in doc]
['Brown', 'bear', '(', 'ursus', 'arcto', ')', ',', 'which', 'be', 'widely', 'distribute', 'throughout', 'the', 'northern', 'hemisphere', ',', 'be', 'recognise', 'as', 'opportunistic', 'omnivore']
```

More examples in our [demo Notebook](https://github.com/nleguillarme/taxonerd/blob/9f5b1e264ba129eeeda383aa8085605c8fa9b379/taxonerd-demo.ipynb).

## Extensions

* [Combining TaxoNERD with gazetteer-based NER for improved taxonomic entities recognition](https://github.com/nleguillarme/taxonerd/blob/a58808e5808d74e341d0d98bc64dfebd7a670b81/extensions/entity_ruler.ipynb)

## License

License: MIT

## Authors

TaxoNERD was written by [nleguillarme](https://github.com/nleguillarme/).

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/nleguillarme/taxonerd",
    "name": "taxonerd",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "spacy, ner, transformers, deep neural networks, ecology, evolution",
    "author": null,
    "author_email": "nicolas.leguillarme@univ-grenoble-alpes.fr",
    "download_url": "https://files.pythonhosted.org/packages/44/70/e01b441656ce2c4ec86fce9448ea3dd6d340211a675699ae8ae7d06b9806/taxonerd-1.5.4.tar.gz",
    "platform": null,
    "description": "![](https://i.ibb.co/G09fX98/taxonerd-logo.png)\n\nLooking for taxon mentions in text? Ask TaxoNERD\n\n* [Features](#features)\n* [Models](#models)\n* [Installation](#installation)\n* [Usage](#usage)\n* [Extensions](#extensions)\n\nI would be happy to hear about your use of TaxoNERD : what is your use case? How did TaxoNERD help you? What could make TaxoNERD even more helpful? Please feel free to drop me an email (nicolas[dot]leguillarme[at]univ-grenoble-alpes[dot]fr) or to open an issue.\n\n## Cite TaxoNERD\n\nLe Guillarme, N., & Thuiller, W. (2022). [TaxoNERD: deep neural models for the recognition of taxonomic entities in the ecological and evolutionary literature](https://doi.org/10.1111/2041-210X.13778). Methods in Ecology and Evolution, 13(3), 625-641.\n\n## Features\n\nTaxoNERD is a domain-specific tool for recognizing taxon mentions in the biodiversity literature.\n\n:tada: **It is now possible to use custom taxonomies for entity linking ! Check our [example Notebook](https://github.com/nleguillarme/taxonerd/blob/61ff9792628214a5524b0f2cc5c205d4ca82bfc9/extensions/custom_taxonomy.ipynb)**\n\n* TaxoNERD is available as a command-line tool, a Python module, a spaCy pipeline, **and a R package thanks to reticulate**.\n* TaxoNERD provides two architectures : the **md** architecture uses spaCy's standard Tok2Vec layer with word vectors for speed, while the **biobert** architecture uses a Transformer-based pretrained language model (dmis-lab/biobert-v1.1) for accuracy.\n* TaxoNERD finds scientific names, common names, abbreviated species names and user-defined abbreviations.\n* TaxoNERD can link taxon mentions to entities in a reference taxonomy (NCBI Taxonomy, GBIF Backbone and TAXREF at the moment, more to come).\n* TaxoNERD is fast (once the model is loaded), and can run on CPU or GPU.\n* Entity linking does not need an internet connection, but may require a lot of RAM depending on the size of the taxonomy (e.g. GBIF Backbone -> ~12.5Gb).\n* Thanks to [textract](https://textract.readthedocs.io/en/stable/), **TaxoNERD can extract taxon mentions from (almost) any document** (including txt, pdf, csv, xls, jpg, png, and many other formats). With TaxoNERD, the detection of taxonomic entities in a JPG file is as simple as that:\n\n<img width=\"50%\" align=\"left\" src=\"https://github.com/nleguillarme/taxonerd/raw/main/tests/test_data/test_jpg/test.jpg\">\n\n\n``` console\ntaxonerd ask -m en_ner_eco_biobert_weak -f ./tests/test_data/test_jpg/test.jpg \nT0\tLIVB 158 165\tspecies\nT1\tLIVB 180 192\tHarbour seal\nT2\tLIVB 194 208\tPhoca vitulina\nT3\tLIVB 361 375\tPacific salmon\nT4\tLIVB 377 394\tOncorhynchus spp.\nT5\tLIVB 455 467\tharbour seal\nT6\tLIVB 663 670\tspecies\nT7\tLIVB 793 805\tharbour seal\nT8\tLIVB 1114 1121\tspecies\nT9\tLIVB 1127 1133\tfishes\nT10\tLIVB 1137 1148\tcephalopods\n```\n\n\n## Models\n\n| Model               |      Description      |  Install URL |\n|---------------------|-------------|------:|\n| en_ner_eco_md       | A spaCy NER model with 50k word vectors (taken from [en_core_sci_md](https://allenai.github.io/scispacy/)), fine-tuned on an ecological corpus. | [Download](https://github.com/nleguillarme/taxonerd/releases/download/v1.5.4/en_ner_eco_md-1.1.0.tar.gz)      |\n| en_ner_eco_biobert | A spaCy NER model with dmis-lab/biobert-v1.1 as the transformer model, fine-tuned on an ecological corpus. | [Download](https://github.com/nleguillarme/taxonerd/releases/download/v1.5.4/en_ner_eco_biobert-1.1.0.tar.gz) |\n| en_core_eco_weak_md | A spaCy NER model with 50k word vectors (taken from [en_core_sci_md](https://allenai.github.io/scispacy/)), fine-tuned on a silver standard corpus (for improved performance on vernacular names). | [Download](https://github.com/nleguillarme/taxonerd/releases/download/v1.5.4/en_ner_eco_md_weak-1.1.0.tar.gz)    |\n| en_core_eco_weak_biobert | A spaCy NER model with dmis-lab/biobert-v1.1 as the transformer model, fine-tuned on a silver standard corpus (for improved performance on vernacular names). | [Download](https://github.com/nleguillarme/taxonerd/releases/download/v1.5.4/en_ner_eco_biobert_weak-1.1.0.tar.gz) |\n\n### What model should I choose ?\n\nIf you have access to a GPU, we recommend using one of the biobert models as they tend to be more accurate than the md models.\n\nThe en_core_eco_weak_md and en_core_eco_weak_biobert have been fine-tuned on a silver standard corpus generated using weak supervision. Therefore, they have been trained on a much larger amount of (noisy) data than their gold standard counterparts. As a result, they tend to have better recall, especially with respect to common names detection. They also have high precision. Nevertheless, their performance has not been accurately evaluated.\n\nIf you do not trust weakly-supervised data and you are not really interested in detecting common names, en_core_eco_md and en_core_eco_biobert are for you. These models have been fine-tuned on a gold standard corpus (a combination of [COPIOUS](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6351503/), [Bacteria Biotope 2019](https://aclanthology.org/D19-5719/), and [BiodivNERE](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9836593/)) and their performance has been benchmarked in our paper.\n\n## Installation\n\n### TaxoNERD for Python\n\nInstalling the package from pip will automatically install all dependencies, including pandas, spaCy, scispaCy and textract. Make sure you install this package before you install the models. Also note that this package requires Python 3.10 and spaCy v3.7.\n\n    $ pip install taxonerd\n\nFor GPU support, find your CUDA version using `nvcc --version` and add the version in brackets, e.g. `pip install taxonerd[cuda12x]` for CUDA 12.1. See [setup.cfg](setup.cfg) for supported CUDA versions.\n\nTo download the models:\n\n    $ pip install https://github.com/nleguillarme/taxonerd/releases/download/v1.5.4/en_ner_eco_md-1.1.0.tar.gz\n    $ pip install https://github.com/nleguillarme/taxonerd/releases/download/v1.5.4/en_ner_eco_biobert-1.1.0.tar.gz\n    $ pip install https://github.com/nleguillarme/taxonerd/releases/download/v1.5.4/en_ner_eco_md_weak-1.1.0.tar.gz\n    $ pip install https://github.com/nleguillarme/taxonerd/releases/download/v1.5.4/en_ner_eco_biobert_weak-1.1.0.tar.gz\n\nEntity linker files are downloaded and cached the first time the linker is used. This may take some time, but it should only be done once. Currently (v1.5.4), there are 3 supported linkers:\n\n* gbif_backbone: Links to [GBIF Backbone Taxonomy (2023-08-28)](https://www.gbif.org/fr/dataset/d7dddbf4-2cf0-4f39-9b2a-bb099caae36c) (~9.5M names for ~3.5M taxa).\n* taxref: Links to [TAXREF (v17)](https://inpn.mnhn.fr/telechargement/referentielEspece/taxref/17.0/menu) (~1.2M names for ~267k taxa).\n* ncbi_taxonomy: Links to [The NCBI Taxonomy (2024-05-22)](https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/) (~3.4M names).\n<!-- * ncbi_taxonomy_lite: Links to [The NCBI Taxonomy](https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/) from which we removed virus names and added abreviated species name (e.g. *P. marina*) (~3.5M names). The ncbi_taxonomy_lite linker supports abbreviated species names out-of-the-box. This means that even if you do not use the abbreviation detector, abbreviated species names such as *P. marina* can be linked to the corresponding taxonomic unit *Pirellula marina* (NCBI:214). -->\n\n### TaxoNERD for R\n\n    > install.packages(\"https://github.com/nleguillarme/taxonerd/releases/download/v1.5.4/taxonerd_for_R_1.5.4.tar.gz\", repos=NULL)\n    > vignette(\"taxonerd\") # See vignette for more information on how to install and use TaxoNERD for R\n\n## Usage\n\nTaxoNERD can be used as:\n* [a command-line tool](#use-as-command-line-tool)\n* [a Python module](#use-as-python-module)\n* [a spaCy pipeline](#use-as-spacy-pipeline)\n\n### Use as command-line tool\n\n``` console\n$ taxonerd ask --help\nUsage: taxonerd ask [OPTIONS] [INPUT_TEXT]\n\nOptions:\n  -m, --model TEXT       A TaxoNERD model [default = en_ner_eco_md]\n  -i, --input-dir TEXT   Input directory\n  -o, --output-dir TEXT  Output directory\n  -f, --filename TEXT    Input text file\n  -a, --with-abbrev      Add abbreviation detector to the pipeline\n  -s, --with-sentence    Add sentence segmenter to the pipeline\n  -l, --link-to TEXT     Add entity linker to the pipeline\n  -t, --thresh FLOAT     Similarity threshold for entity linking [default =\n                         0.7]\n\n  --prefer-gpu           Use GPU if available\n  -v, --verbose          Verbose mode\n  --help                 Show this message and exit.\n```\n\n  #### Examples\n\n  ##### Taxonomic NER from the terminal\n\n``` console\n$ taxonerd ask -m en_ner_eco_biobert \"Brown bears (Ursus arctos), which are widely distributed throughout the northern hemisphere, are recognised as opportunistic omnivores\"\nT0\tLIVB 0 11\tBrown bears\nT1\tLIVB 13 25\tUrsus arctos\n```\n\n  ##### Taxonomic NER with entity linking from the terminal\n\n``` console\n$ taxonerd ask -m en_ner_eco_biobert -l gbif_backbone \"Brown bears (Ursus arctos), which are widely distributed throughout the northern hemisphere, are recognised as opportunistic omnivores\"\nT0\tLIVB 0 11\tBrown bears\t[('GBIF:2433433', 'Brown Bear', 0.8313919901847839)]\nT1\tLIVB 13 25\tUrsus arctos\t[('GBIF:2433433', 'Ursus arctos', 1.0)]\n\n$ taxonerd ask -m en_ner_eco_biobert -l gbif_backbone -t 0.85 \"Brown bears (Ursus arctos), which are widely distributed throughout the northern hemisphere, are recognised as opportunistic omnivores\"\nT0\tLIVB 13 25\tUrsus arctos\t[('GBIF:2433433', 'Ursus arctos', 1.0)]\n```\n\n  ##### Taxonomic NER from a text file (with abbreviation detection)\n\n``` console\n$ taxonerd ask -m en_ner_eco_biobert --with-abbrev -f ./tests/test_data/test_txt/test1.txt\nT0\tLIVB 4 21\tpinewood nematode\nT1\tLIVB 23 26\tPWN\nT2\tLIVB 29 55\tBursaphelenchus xylophilus\nT3\tLIVB 57 70\tB. xylophilus\nT4\tLIVB 99 108\tpine wilt\nT5\tLIVB 196 204\tSerratia\nT6\tLIVB 257 260\tPWN\nT7\tLIVB 342 364\tSerratia grimesii BXF1\nT8\tLIVB 387 390\tPWN\nT9\tLIVB 440 444\tBXF1\n```\n\n  ##### Taxonomic NER from a directory containing text files, with results written in the output directory\n\n``` console\n$ taxonerd ask --focus-on accuracy -i ./tests/test_data/test_txt -o test_ann\n$ ls test_ann/\ntest1.ann  test2.ann\n$ cat test_ann/test2.ann\nT0\tLIVB 700 711\tBrown bears\nT1\tLIVB 713 725\tUrsus arctos\nT2\tLIVB 1062 1073\tbrown bears\nT3\tLIVB 1161 1172\tbrown bears\nT4\tLIVB 1339 1350\tbrown bears\nT5\tLIVB 1555 1565\tbrown bear\nT6\tLIVB 1782 1793\tbrown bears\nT7\tLIVB 1863 1874\tbrown bears\nT8\tLIVB 1958 1969\tbrown bears\nT9\tLIVB 1974 1980\tsalmon\nT10\tLIVB 2026 2037\tbrown bears\nT11\tLIVB 2219 2230\tbrown bears\nT12\tLIVB 2392 2401\tSika deer\nT13\tLIVB 2403 2416\tCervus nippon\nT14\tLIVB 2555 2559\tdeer\nT15\tLIVB 2594 2604\tbrown bear\nT16\tLIVB 2798 2808\tbrown bear\nT17\tLIVB 3146 3150\tdeer\nT18\tLIVB 3188 3199\tchum salmon\nT19\tLIVB 3201 3218\tOncorhynchus keta\nT20\tLIVB 3280 3289\tSika deer\nT21\tLIVB 3350 3361\tpink salmon\nT22\tLIVB 3363 3375\tO. gorbuscha\nT23\tLIVB 3381 3392\tchum salmon\nT24\tLIVB 3518 3528\tBrown bear\nT25\tLIVB 4001 4012\tbrown bears\nT26\tLIVB 4071 4082\tbrown bears\n```\n\n### Use as python module\n\n``` python\n>>> from taxonerd import TaxoNERD\n>>> taxonerd = TaxoNERD(prefer_gpu=False)\n>>> nlp = taxonerd.load(model=\"en_ner_eco_md\", exclude=[], linker=\"taxref\", threshold=0.7)\n>>> nlp.pipe_names\n['tok2vec', 'tagger', 'attribute_ruler', 'lemmatizer', 'pysbd_sentencizer', 'parser', 'ner', 'taxo_abbrev_detector', 'taxon_linker']\n```\n\n**N.B.** By default, all components are included in the pipeline. Use the ``exclude`` argument to specify the components to exclude. Excluded components won\u2019t be loaded. This may speed up the detection process. The minimal pipeline for taxonomic NER is ``['tok2vec', 'ner']``.\n\n#### Examples\n\n  ##### Find taxonomic entities in an input string\n\n``` python\n>>> taxonerd.find_in_text(\"Brown bears (Ursus arctos), which are widely distributed throughout the northern hemisphere, are recognised as opportunistic omnivore\")\n      offsets           text                               entity  sent\nT0  LIVB 13 25  Ursus arctos  [(TAXREF:60826, Ursus arctos, 1.0)]     0\n```\n\n  ##### Find taxonomic entities in an input file\n\n``` python\n>>> taxonerd.find_in_file(\"./tests/test_data/test_txt/test2.txt\", output_dir=None)\n            offsets               text                                             entity  sent\nT0     LIVB 713 725       Ursus arctos                [(TAXREF:60826, Ursus arctos, 1.0)]     4\nT1   LIVB 1974 1980             salmon      [(TAXREF:730671, Salmonia, 0.85158771276474)]    12\nT2   LIVB 2392 2401          Sika deer                   [(TAXREF:61025, Sika Deer, 1.0)]    14\nT3   LIVB 2403 2416      Cervus nippon               [(TAXREF:61025, Cervus nippon, 1.0)]    14\nT4   LIVB 3135 3141             salmon      [(TAXREF:730671, Salmonia, 0.85158771276474)]    18\nT5   LIVB 3146 3150               deer                       [(TAXREF:186210, deer, 1.0)]    18\nT6   LIVB 3188 3199        chum salmon    [(TAXREF:730671, Salmonia, 0.7018352746963501)]    19\nT7   LIVB 3201 3218  Oncorhynchus keta  [(TAXREF:195439, Oncorhynchus, 0.8319146037101...    19\nT8   LIVB 3280 3289          Sika deer                   [(TAXREF:61025, Sika Deer, 1.0)]    19\nT9   LIVB 3350 3361        pink salmon                 [(TAXREF:67798, Pink Salmon, 1.0)]    20\nT10  LIVB 3381 3392        chum salmon    [(TAXREF:730671, Salmonia, 0.7018352746963501)]    20\nT11  LIVB 3481 3485               deer                       [(TAXREF:186210, deer, 1.0)]    20\n```\n\n  ##### Find taxonomic entities in all the files in the input directory, and write the results in the output directory\n\n``` python\n>>> taxonerd.find_in_corpus(\"./tests/test_data/test_txt\", \"./test_ann\")\n{'test1.txt': './test_ann/test1.ann', 'test2.txt': './test_ann/test2.ann'}\n```\n\n### Use as spaCy pipeline\n``` python\n>>> from taxonerd import TaxoNERD\n>>> taxonerd = TaxoNERD(prefer_gpu=True)\n>>> nlp = taxonerd.load(model=\"en_ner_eco_biobert\")\n>>> doc = nlp(\"Brown bears (Ursus arctos), which are widely distributed throughout the northern hemisphere, are recognised as opportunistic omnivore\")\n>>> doc.ents\n(Brown bears, Ursus arctos)\n>>> [tok.lemma_ for tok in doc]\n['Brown', 'bear', '(', 'ursus', 'arcto', ')', ',', 'which', 'be', 'widely', 'distribute', 'throughout', 'the', 'northern', 'hemisphere', ',', 'be', 'recognise', 'as', 'opportunistic', 'omnivore']\n```\n\nMore examples in our [demo Notebook](https://github.com/nleguillarme/taxonerd/blob/9f5b1e264ba129eeeda383aa8085605c8fa9b379/taxonerd-demo.ipynb).\n\n## Extensions\n\n* [Combining TaxoNERD with gazetteer-based NER for improved taxonomic entities recognition](https://github.com/nleguillarme/taxonerd/blob/a58808e5808d74e341d0d98bc64dfebd7a670b81/extensions/entity_ruler.ipynb)\n\n## License\n\nLicense: MIT\n\n## Authors\n\nTaxoNERD was written by [nleguillarme](https://github.com/nleguillarme/).\n",
    "bugtrack_url": null,
    "license": "MIT License",
    "summary": "A Python package and CLI tool based on spaCy for detecting mentions of taxonomic entities in text",
    "version": "1.5.4",
    "project_urls": {
        "Bug Tracker": "https://github.com/nleguillarme/taxonerd/issues",
        "Homepage": "https://github.com/nleguillarme/taxonerd"
    },
    "split_keywords": [
        "spacy",
        " ner",
        " transformers",
        " deep neural networks",
        " ecology",
        " evolution"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4470e01b441656ce2c4ec86fce9448ea3dd6d340211a675699ae8ae7d06b9806",
                "md5": "fbf7ea97d9909eec03aae48cd1834a5f",
                "sha256": "7dccb0083d2a57b847bb54778f28bc5d52dedfb9fbca665fc115aa368f005b45"
            },
            "downloads": -1,
            "filename": "taxonerd-1.5.4.tar.gz",
            "has_sig": false,
            "md5_digest": "fbf7ea97d9909eec03aae48cd1834a5f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 31216,
            "upload_time": "2024-06-12T08:11:47",
            "upload_time_iso_8601": "2024-06-12T08:11:47.943407Z",
            "url": "https://files.pythonhosted.org/packages/44/70/e01b441656ce2c4ec86fce9448ea3dd6d340211a675699ae8ae7d06b9806/taxonerd-1.5.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-06-12 08:11:47",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "nleguillarme",
    "github_project": "taxonerd",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "taxonerd"
}

None