![](https://i.ibb.co/G09fX98/taxonerd-logo.png)
Looking for taxon mentions in text? Ask TaxoNERD
* [Features](#features)
* [Models](#models)
* [Installation](#installation)
* [Usage](#usage)
* [Extensions](#extensions)
I would be happy to hear about your use of TaxoNERD : what is your use case? How did TaxoNERD help you? What could make TaxoNERD even more helpful? Please feel free to drop me an email (nicolas[dot]leguillarme[at]univ-grenoble-alpes[dot]fr) or to open an issue.
## Cite TaxoNERD
Le Guillarme, N., & Thuiller, W. (2022). [TaxoNERD: deep neural models for the recognition of taxonomic entities in the ecological and evolutionary literature](https://doi.org/10.1111/2041-210X.13778). Methods in Ecology and Evolution, 13(3), 625-641.
## Features
TaxoNERD is a domain-specific tool for recognizing taxon mentions in the biodiversity literature.
:tada: **It is now possible to use custom taxonomies for entity linking ! Check our [example Notebook](https://github.com/nleguillarme/taxonerd/blob/61ff9792628214a5524b0f2cc5c205d4ca82bfc9/extensions/custom_taxonomy.ipynb)**
* TaxoNERD is available as a command-line tool, a Python module, a spaCy pipeline, **and a R package thanks to reticulate**.
* TaxoNERD provides two architectures : the **md** architecture uses spaCy's standard Tok2Vec layer with word vectors for speed, while the **biobert** architecture uses a Transformer-based pretrained language model (dmis-lab/biobert-v1.1) for accuracy.
* TaxoNERD finds scientific names, common names, abbreviated species names and user-defined abbreviations.
* TaxoNERD can link taxon mentions to entities in a reference taxonomy (NCBI Taxonomy, GBIF Backbone and TAXREF at the moment, more to come).
* TaxoNERD is fast (once the model is loaded), and can run on CPU or GPU.
* Entity linking does not need an internet connection, but may require a lot of RAM depending on the size of the taxonomy (e.g. GBIF Backbone -> ~12.5Gb).
* Thanks to [textract](https://textract.readthedocs.io/en/stable/), **TaxoNERD can extract taxon mentions from (almost) any document** (including txt, pdf, csv, xls, jpg, png, and many other formats). With TaxoNERD, the detection of taxonomic entities in a JPG file is as simple as that:
<img width="50%" align="left" src="https://github.com/nleguillarme/taxonerd/raw/main/tests/test_data/test_jpg/test.jpg">
``` console
taxonerd ask -m en_ner_eco_biobert_weak -f ./tests/test_data/test_jpg/test.jpg
T0 LIVB 158 165 species
T1 LIVB 180 192 Harbour seal
T2 LIVB 194 208 Phoca vitulina
T3 LIVB 361 375 Pacific salmon
T4 LIVB 377 394 Oncorhynchus spp.
T5 LIVB 455 467 harbour seal
T6 LIVB 663 670 species
T7 LIVB 793 805 harbour seal
T8 LIVB 1114 1121 species
T9 LIVB 1127 1133 fishes
T10 LIVB 1137 1148 cephalopods
```
## Models
| Model | Description | Install URL |
|---------------------|-------------|------:|
| en_ner_eco_md | A spaCy NER model with 50k word vectors (taken from [en_core_sci_md](https://allenai.github.io/scispacy/)), fine-tuned on an ecological corpus. | [Download](https://github.com/nleguillarme/taxonerd/releases/download/v1.5.4/en_ner_eco_md-1.1.0.tar.gz) |
| en_ner_eco_biobert | A spaCy NER model with dmis-lab/biobert-v1.1 as the transformer model, fine-tuned on an ecological corpus. | [Download](https://github.com/nleguillarme/taxonerd/releases/download/v1.5.4/en_ner_eco_biobert-1.1.0.tar.gz) |
| en_core_eco_weak_md | A spaCy NER model with 50k word vectors (taken from [en_core_sci_md](https://allenai.github.io/scispacy/)), fine-tuned on a silver standard corpus (for improved performance on vernacular names). | [Download](https://github.com/nleguillarme/taxonerd/releases/download/v1.5.4/en_ner_eco_md_weak-1.1.0.tar.gz) |
| en_core_eco_weak_biobert | A spaCy NER model with dmis-lab/biobert-v1.1 as the transformer model, fine-tuned on a silver standard corpus (for improved performance on vernacular names). | [Download](https://github.com/nleguillarme/taxonerd/releases/download/v1.5.4/en_ner_eco_biobert_weak-1.1.0.tar.gz) |
### What model should I choose ?
If you have access to a GPU, we recommend using one of the biobert models as they tend to be more accurate than the md models.
The en_core_eco_weak_md and en_core_eco_weak_biobert have been fine-tuned on a silver standard corpus generated using weak supervision. Therefore, they have been trained on a much larger amount of (noisy) data than their gold standard counterparts. As a result, they tend to have better recall, especially with respect to common names detection. They also have high precision. Nevertheless, their performance has not been accurately evaluated.
If you do not trust weakly-supervised data and you are not really interested in detecting common names, en_core_eco_md and en_core_eco_biobert are for you. These models have been fine-tuned on a gold standard corpus (a combination of [COPIOUS](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6351503/), [Bacteria Biotope 2019](https://aclanthology.org/D19-5719/), and [BiodivNERE](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9836593/)) and their performance has been benchmarked in our paper.
## Installation
### TaxoNERD for Python
Installing the package from pip will automatically install all dependencies, including pandas, spaCy, scispaCy and textract. Make sure you install this package before you install the models. Also note that this package requires Python 3.10 and spaCy v3.7.
$ pip install taxonerd
For GPU support, find your CUDA version using `nvcc --version` and add the version in brackets, e.g. `pip install taxonerd[cuda12x]` for CUDA 12.1. See [setup.cfg](setup.cfg) for supported CUDA versions.
To download the models:
$ pip install https://github.com/nleguillarme/taxonerd/releases/download/v1.5.4/en_ner_eco_md-1.1.0.tar.gz
$ pip install https://github.com/nleguillarme/taxonerd/releases/download/v1.5.4/en_ner_eco_biobert-1.1.0.tar.gz
$ pip install https://github.com/nleguillarme/taxonerd/releases/download/v1.5.4/en_ner_eco_md_weak-1.1.0.tar.gz
$ pip install https://github.com/nleguillarme/taxonerd/releases/download/v1.5.4/en_ner_eco_biobert_weak-1.1.0.tar.gz
Entity linker files are downloaded and cached the first time the linker is used. This may take some time, but it should only be done once. Currently (v1.5.4), there are 3 supported linkers:
* gbif_backbone: Links to [GBIF Backbone Taxonomy (2023-08-28)](https://www.gbif.org/fr/dataset/d7dddbf4-2cf0-4f39-9b2a-bb099caae36c) (~9.5M names for ~3.5M taxa).
* taxref: Links to [TAXREF (v17)](https://inpn.mnhn.fr/telechargement/referentielEspece/taxref/17.0/menu) (~1.2M names for ~267k taxa).
* ncbi_taxonomy: Links to [The NCBI Taxonomy (2024-05-22)](https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/) (~3.4M names).
<!-- * ncbi_taxonomy_lite: Links to [The NCBI Taxonomy](https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/) from which we removed virus names and added abreviated species name (e.g. *P. marina*) (~3.5M names). The ncbi_taxonomy_lite linker supports abbreviated species names out-of-the-box. This means that even if you do not use the abbreviation detector, abbreviated species names such as *P. marina* can be linked to the corresponding taxonomic unit *Pirellula marina* (NCBI:214). -->
### TaxoNERD for R
> install.packages("https://github.com/nleguillarme/taxonerd/releases/download/v1.5.4/taxonerd_for_R_1.5.4.tar.gz", repos=NULL)
> vignette("taxonerd") # See vignette for more information on how to install and use TaxoNERD for R
## Usage
TaxoNERD can be used as:
* [a command-line tool](#use-as-command-line-tool)
* [a Python module](#use-as-python-module)
* [a spaCy pipeline](#use-as-spacy-pipeline)
### Use as command-line tool
``` console
$ taxonerd ask --help
Usage: taxonerd ask [OPTIONS] [INPUT_TEXT]
Options:
-m, --model TEXT A TaxoNERD model [default = en_ner_eco_md]
-i, --input-dir TEXT Input directory
-o, --output-dir TEXT Output directory
-f, --filename TEXT Input text file
-a, --with-abbrev Add abbreviation detector to the pipeline
-s, --with-sentence Add sentence segmenter to the pipeline
-l, --link-to TEXT Add entity linker to the pipeline
-t, --thresh FLOAT Similarity threshold for entity linking [default =
0.7]
--prefer-gpu Use GPU if available
-v, --verbose Verbose mode
--help Show this message and exit.
```
#### Examples
##### Taxonomic NER from the terminal
``` console
$ taxonerd ask -m en_ner_eco_biobert "Brown bears (Ursus arctos), which are widely distributed throughout the northern hemisphere, are recognised as opportunistic omnivores"
T0 LIVB 0 11 Brown bears
T1 LIVB 13 25 Ursus arctos
```
##### Taxonomic NER with entity linking from the terminal
``` console
$ taxonerd ask -m en_ner_eco_biobert -l gbif_backbone "Brown bears (Ursus arctos), which are widely distributed throughout the northern hemisphere, are recognised as opportunistic omnivores"
T0 LIVB 0 11 Brown bears [('GBIF:2433433', 'Brown Bear', 0.8313919901847839)]
T1 LIVB 13 25 Ursus arctos [('GBIF:2433433', 'Ursus arctos', 1.0)]
$ taxonerd ask -m en_ner_eco_biobert -l gbif_backbone -t 0.85 "Brown bears (Ursus arctos), which are widely distributed throughout the northern hemisphere, are recognised as opportunistic omnivores"
T0 LIVB 13 25 Ursus arctos [('GBIF:2433433', 'Ursus arctos', 1.0)]
```
##### Taxonomic NER from a text file (with abbreviation detection)
``` console
$ taxonerd ask -m en_ner_eco_biobert --with-abbrev -f ./tests/test_data/test_txt/test1.txt
T0 LIVB 4 21 pinewood nematode
T1 LIVB 23 26 PWN
T2 LIVB 29 55 Bursaphelenchus xylophilus
T3 LIVB 57 70 B. xylophilus
T4 LIVB 99 108 pine wilt
T5 LIVB 196 204 Serratia
T6 LIVB 257 260 PWN
T7 LIVB 342 364 Serratia grimesii BXF1
T8 LIVB 387 390 PWN
T9 LIVB 440 444 BXF1
```
##### Taxonomic NER from a directory containing text files, with results written in the output directory
``` console
$ taxonerd ask --focus-on accuracy -i ./tests/test_data/test_txt -o test_ann
$ ls test_ann/
test1.ann test2.ann
$ cat test_ann/test2.ann
T0 LIVB 700 711 Brown bears
T1 LIVB 713 725 Ursus arctos
T2 LIVB 1062 1073 brown bears
T3 LIVB 1161 1172 brown bears
T4 LIVB 1339 1350 brown bears
T5 LIVB 1555 1565 brown bear
T6 LIVB 1782 1793 brown bears
T7 LIVB 1863 1874 brown bears
T8 LIVB 1958 1969 brown bears
T9 LIVB 1974 1980 salmon
T10 LIVB 2026 2037 brown bears
T11 LIVB 2219 2230 brown bears
T12 LIVB 2392 2401 Sika deer
T13 LIVB 2403 2416 Cervus nippon
T14 LIVB 2555 2559 deer
T15 LIVB 2594 2604 brown bear
T16 LIVB 2798 2808 brown bear
T17 LIVB 3146 3150 deer
T18 LIVB 3188 3199 chum salmon
T19 LIVB 3201 3218 Oncorhynchus keta
T20 LIVB 3280 3289 Sika deer
T21 LIVB 3350 3361 pink salmon
T22 LIVB 3363 3375 O. gorbuscha
T23 LIVB 3381 3392 chum salmon
T24 LIVB 3518 3528 Brown bear
T25 LIVB 4001 4012 brown bears
T26 LIVB 4071 4082 brown bears
```
### Use as python module
``` python
>>> from taxonerd import TaxoNERD
>>> taxonerd = TaxoNERD(prefer_gpu=False)
>>> nlp = taxonerd.load(model="en_ner_eco_md", exclude=[], linker="taxref", threshold=0.7)
>>> nlp.pipe_names
['tok2vec', 'tagger', 'attribute_ruler', 'lemmatizer', 'pysbd_sentencizer', 'parser', 'ner', 'taxo_abbrev_detector', 'taxon_linker']
```
**N.B.** By default, all components are included in the pipeline. Use the ``exclude`` argument to specify the components to exclude. Excluded components won’t be loaded. This may speed up the detection process. The minimal pipeline for taxonomic NER is ``['tok2vec', 'ner']``.
#### Examples
##### Find taxonomic entities in an input string
``` python
>>> taxonerd.find_in_text("Brown bears (Ursus arctos), which are widely distributed throughout the northern hemisphere, are recognised as opportunistic omnivore")
offsets text entity sent
T0 LIVB 13 25 Ursus arctos [(TAXREF:60826, Ursus arctos, 1.0)] 0
```
##### Find taxonomic entities in an input file
``` python
>>> taxonerd.find_in_file("./tests/test_data/test_txt/test2.txt", output_dir=None)
offsets text entity sent
T0 LIVB 713 725 Ursus arctos [(TAXREF:60826, Ursus arctos, 1.0)] 4
T1 LIVB 1974 1980 salmon [(TAXREF:730671, Salmonia, 0.85158771276474)] 12
T2 LIVB 2392 2401 Sika deer [(TAXREF:61025, Sika Deer, 1.0)] 14
T3 LIVB 2403 2416 Cervus nippon [(TAXREF:61025, Cervus nippon, 1.0)] 14
T4 LIVB 3135 3141 salmon [(TAXREF:730671, Salmonia, 0.85158771276474)] 18
T5 LIVB 3146 3150 deer [(TAXREF:186210, deer, 1.0)] 18
T6 LIVB 3188 3199 chum salmon [(TAXREF:730671, Salmonia, 0.7018352746963501)] 19
T7 LIVB 3201 3218 Oncorhynchus keta [(TAXREF:195439, Oncorhynchus, 0.8319146037101... 19
T8 LIVB 3280 3289 Sika deer [(TAXREF:61025, Sika Deer, 1.0)] 19
T9 LIVB 3350 3361 pink salmon [(TAXREF:67798, Pink Salmon, 1.0)] 20
T10 LIVB 3381 3392 chum salmon [(TAXREF:730671, Salmonia, 0.7018352746963501)] 20
T11 LIVB 3481 3485 deer [(TAXREF:186210, deer, 1.0)] 20
```
##### Find taxonomic entities in all the files in the input directory, and write the results in the output directory
``` python
>>> taxonerd.find_in_corpus("./tests/test_data/test_txt", "./test_ann")
{'test1.txt': './test_ann/test1.ann', 'test2.txt': './test_ann/test2.ann'}
```
### Use as spaCy pipeline
``` python
>>> from taxonerd import TaxoNERD
>>> taxonerd = TaxoNERD(prefer_gpu=True)
>>> nlp = taxonerd.load(model="en_ner_eco_biobert")
>>> doc = nlp("Brown bears (Ursus arctos), which are widely distributed throughout the northern hemisphere, are recognised as opportunistic omnivore")
>>> doc.ents
(Brown bears, Ursus arctos)
>>> [tok.lemma_ for tok in doc]
['Brown', 'bear', '(', 'ursus', 'arcto', ')', ',', 'which', 'be', 'widely', 'distribute', 'throughout', 'the', 'northern', 'hemisphere', ',', 'be', 'recognise', 'as', 'opportunistic', 'omnivore']
```
More examples in our [demo Notebook](https://github.com/nleguillarme/taxonerd/blob/9f5b1e264ba129eeeda383aa8085605c8fa9b379/taxonerd-demo.ipynb).
## Extensions
* [Combining TaxoNERD with gazetteer-based NER for improved taxonomic entities recognition](https://github.com/nleguillarme/taxonerd/blob/a58808e5808d74e341d0d98bc64dfebd7a670b81/extensions/entity_ruler.ipynb)
## License
License: MIT
## Authors
TaxoNERD was written by [nleguillarme](https://github.com/nleguillarme/).
Raw data
{
"_id": null,
"home_page": "https://github.com/nleguillarme/taxonerd",
"name": "taxonerd",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "spacy, ner, transformers, deep neural networks, ecology, evolution",
"author": null,
"author_email": "nicolas.leguillarme@univ-grenoble-alpes.fr",
"download_url": "https://files.pythonhosted.org/packages/44/70/e01b441656ce2c4ec86fce9448ea3dd6d340211a675699ae8ae7d06b9806/taxonerd-1.5.4.tar.gz",
"platform": null,
"description": "![](https://i.ibb.co/G09fX98/taxonerd-logo.png)\n\nLooking for taxon mentions in text? Ask TaxoNERD\n\n* [Features](#features)\n* [Models](#models)\n* [Installation](#installation)\n* [Usage](#usage)\n* [Extensions](#extensions)\n\nI would be happy to hear about your use of TaxoNERD : what is your use case? How did TaxoNERD help you? What could make TaxoNERD even more helpful? Please feel free to drop me an email (nicolas[dot]leguillarme[at]univ-grenoble-alpes[dot]fr) or to open an issue.\n\n## Cite TaxoNERD\n\nLe Guillarme, N., & Thuiller, W. (2022). [TaxoNERD: deep neural models for the recognition of taxonomic entities in the ecological and evolutionary literature](https://doi.org/10.1111/2041-210X.13778). Methods in Ecology and Evolution, 13(3), 625-641.\n\n## Features\n\nTaxoNERD is a domain-specific tool for recognizing taxon mentions in the biodiversity literature.\n\n:tada: **It is now possible to use custom taxonomies for entity linking ! Check our [example Notebook](https://github.com/nleguillarme/taxonerd/blob/61ff9792628214a5524b0f2cc5c205d4ca82bfc9/extensions/custom_taxonomy.ipynb)**\n\n* TaxoNERD is available as a command-line tool, a Python module, a spaCy pipeline, **and a R package thanks to reticulate**.\n* TaxoNERD provides two architectures : the **md** architecture uses spaCy's standard Tok2Vec layer with word vectors for speed, while the **biobert** architecture uses a Transformer-based pretrained language model (dmis-lab/biobert-v1.1) for accuracy.\n* TaxoNERD finds scientific names, common names, abbreviated species names and user-defined abbreviations.\n* TaxoNERD can link taxon mentions to entities in a reference taxonomy (NCBI Taxonomy, GBIF Backbone and TAXREF at the moment, more to come).\n* TaxoNERD is fast (once the model is loaded), and can run on CPU or GPU.\n* Entity linking does not need an internet connection, but may require a lot of RAM depending on the size of the taxonomy (e.g. GBIF Backbone -> ~12.5Gb).\n* Thanks to [textract](https://textract.readthedocs.io/en/stable/), **TaxoNERD can extract taxon mentions from (almost) any document** (including txt, pdf, csv, xls, jpg, png, and many other formats). With TaxoNERD, the detection of taxonomic entities in a JPG file is as simple as that:\n\n<img width=\"50%\" align=\"left\" src=\"https://github.com/nleguillarme/taxonerd/raw/main/tests/test_data/test_jpg/test.jpg\">\n\n\n``` console\ntaxonerd ask -m en_ner_eco_biobert_weak -f ./tests/test_data/test_jpg/test.jpg \nT0\tLIVB 158 165\tspecies\nT1\tLIVB 180 192\tHarbour seal\nT2\tLIVB 194 208\tPhoca vitulina\nT3\tLIVB 361 375\tPacific salmon\nT4\tLIVB 377 394\tOncorhynchus spp.\nT5\tLIVB 455 467\tharbour seal\nT6\tLIVB 663 670\tspecies\nT7\tLIVB 793 805\tharbour seal\nT8\tLIVB 1114 1121\tspecies\nT9\tLIVB 1127 1133\tfishes\nT10\tLIVB 1137 1148\tcephalopods\n```\n\n\n## Models\n\n| Model | Description | Install URL |\n|---------------------|-------------|------:|\n| en_ner_eco_md | A spaCy NER model with 50k word vectors (taken from [en_core_sci_md](https://allenai.github.io/scispacy/)), fine-tuned on an ecological corpus. | [Download](https://github.com/nleguillarme/taxonerd/releases/download/v1.5.4/en_ner_eco_md-1.1.0.tar.gz) |\n| en_ner_eco_biobert | A spaCy NER model with dmis-lab/biobert-v1.1 as the transformer model, fine-tuned on an ecological corpus. | [Download](https://github.com/nleguillarme/taxonerd/releases/download/v1.5.4/en_ner_eco_biobert-1.1.0.tar.gz) |\n| en_core_eco_weak_md | A spaCy NER model with 50k word vectors (taken from [en_core_sci_md](https://allenai.github.io/scispacy/)), fine-tuned on a silver standard corpus (for improved performance on vernacular names). | [Download](https://github.com/nleguillarme/taxonerd/releases/download/v1.5.4/en_ner_eco_md_weak-1.1.0.tar.gz) |\n| en_core_eco_weak_biobert | A spaCy NER model with dmis-lab/biobert-v1.1 as the transformer model, fine-tuned on a silver standard corpus (for improved performance on vernacular names). | [Download](https://github.com/nleguillarme/taxonerd/releases/download/v1.5.4/en_ner_eco_biobert_weak-1.1.0.tar.gz) |\n\n### What model should I choose ?\n\nIf you have access to a GPU, we recommend using one of the biobert models as they tend to be more accurate than the md models.\n\nThe en_core_eco_weak_md and en_core_eco_weak_biobert have been fine-tuned on a silver standard corpus generated using weak supervision. Therefore, they have been trained on a much larger amount of (noisy) data than their gold standard counterparts. As a result, they tend to have better recall, especially with respect to common names detection. They also have high precision. Nevertheless, their performance has not been accurately evaluated.\n\nIf you do not trust weakly-supervised data and you are not really interested in detecting common names, en_core_eco_md and en_core_eco_biobert are for you. These models have been fine-tuned on a gold standard corpus (a combination of [COPIOUS](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6351503/), [Bacteria Biotope 2019](https://aclanthology.org/D19-5719/), and [BiodivNERE](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9836593/)) and their performance has been benchmarked in our paper.\n\n## Installation\n\n### TaxoNERD for Python\n\nInstalling the package from pip will automatically install all dependencies, including pandas, spaCy, scispaCy and textract. Make sure you install this package before you install the models. Also note that this package requires Python 3.10 and spaCy v3.7.\n\n $ pip install taxonerd\n\nFor GPU support, find your CUDA version using `nvcc --version` and add the version in brackets, e.g. `pip install taxonerd[cuda12x]` for CUDA 12.1. See [setup.cfg](setup.cfg) for supported CUDA versions.\n\nTo download the models:\n\n $ pip install https://github.com/nleguillarme/taxonerd/releases/download/v1.5.4/en_ner_eco_md-1.1.0.tar.gz\n $ pip install https://github.com/nleguillarme/taxonerd/releases/download/v1.5.4/en_ner_eco_biobert-1.1.0.tar.gz\n $ pip install https://github.com/nleguillarme/taxonerd/releases/download/v1.5.4/en_ner_eco_md_weak-1.1.0.tar.gz\n $ pip install https://github.com/nleguillarme/taxonerd/releases/download/v1.5.4/en_ner_eco_biobert_weak-1.1.0.tar.gz\n\nEntity linker files are downloaded and cached the first time the linker is used. This may take some time, but it should only be done once. Currently (v1.5.4), there are 3 supported linkers:\n\n* gbif_backbone: Links to [GBIF Backbone Taxonomy (2023-08-28)](https://www.gbif.org/fr/dataset/d7dddbf4-2cf0-4f39-9b2a-bb099caae36c) (~9.5M names for ~3.5M taxa).\n* taxref: Links to [TAXREF (v17)](https://inpn.mnhn.fr/telechargement/referentielEspece/taxref/17.0/menu) (~1.2M names for ~267k taxa).\n* ncbi_taxonomy: Links to [The NCBI Taxonomy (2024-05-22)](https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/) (~3.4M names).\n<!-- * ncbi_taxonomy_lite: Links to [The NCBI Taxonomy](https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/) from which we removed virus names and added abreviated species name (e.g. *P. marina*) (~3.5M names). The ncbi_taxonomy_lite linker supports abbreviated species names out-of-the-box. This means that even if you do not use the abbreviation detector, abbreviated species names such as *P. marina* can be linked to the corresponding taxonomic unit *Pirellula marina* (NCBI:214). -->\n\n### TaxoNERD for R\n\n > install.packages(\"https://github.com/nleguillarme/taxonerd/releases/download/v1.5.4/taxonerd_for_R_1.5.4.tar.gz\", repos=NULL)\n > vignette(\"taxonerd\") # See vignette for more information on how to install and use TaxoNERD for R\n\n## Usage\n\nTaxoNERD can be used as:\n* [a command-line tool](#use-as-command-line-tool)\n* [a Python module](#use-as-python-module)\n* [a spaCy pipeline](#use-as-spacy-pipeline)\n\n### Use as command-line tool\n\n``` console\n$ taxonerd ask --help\nUsage: taxonerd ask [OPTIONS] [INPUT_TEXT]\n\nOptions:\n -m, --model TEXT A TaxoNERD model [default = en_ner_eco_md]\n -i, --input-dir TEXT Input directory\n -o, --output-dir TEXT Output directory\n -f, --filename TEXT Input text file\n -a, --with-abbrev Add abbreviation detector to the pipeline\n -s, --with-sentence Add sentence segmenter to the pipeline\n -l, --link-to TEXT Add entity linker to the pipeline\n -t, --thresh FLOAT Similarity threshold for entity linking [default =\n 0.7]\n\n --prefer-gpu Use GPU if available\n -v, --verbose Verbose mode\n --help Show this message and exit.\n```\n\n #### Examples\n\n ##### Taxonomic NER from the terminal\n\n``` console\n$ taxonerd ask -m en_ner_eco_biobert \"Brown bears (Ursus arctos), which are widely distributed throughout the northern hemisphere, are recognised as opportunistic omnivores\"\nT0\tLIVB 0 11\tBrown bears\nT1\tLIVB 13 25\tUrsus arctos\n```\n\n ##### Taxonomic NER with entity linking from the terminal\n\n``` console\n$ taxonerd ask -m en_ner_eco_biobert -l gbif_backbone \"Brown bears (Ursus arctos), which are widely distributed throughout the northern hemisphere, are recognised as opportunistic omnivores\"\nT0\tLIVB 0 11\tBrown bears\t[('GBIF:2433433', 'Brown Bear', 0.8313919901847839)]\nT1\tLIVB 13 25\tUrsus arctos\t[('GBIF:2433433', 'Ursus arctos', 1.0)]\n\n$ taxonerd ask -m en_ner_eco_biobert -l gbif_backbone -t 0.85 \"Brown bears (Ursus arctos), which are widely distributed throughout the northern hemisphere, are recognised as opportunistic omnivores\"\nT0\tLIVB 13 25\tUrsus arctos\t[('GBIF:2433433', 'Ursus arctos', 1.0)]\n```\n\n ##### Taxonomic NER from a text file (with abbreviation detection)\n\n``` console\n$ taxonerd ask -m en_ner_eco_biobert --with-abbrev -f ./tests/test_data/test_txt/test1.txt\nT0\tLIVB 4 21\tpinewood nematode\nT1\tLIVB 23 26\tPWN\nT2\tLIVB 29 55\tBursaphelenchus xylophilus\nT3\tLIVB 57 70\tB. xylophilus\nT4\tLIVB 99 108\tpine wilt\nT5\tLIVB 196 204\tSerratia\nT6\tLIVB 257 260\tPWN\nT7\tLIVB 342 364\tSerratia grimesii BXF1\nT8\tLIVB 387 390\tPWN\nT9\tLIVB 440 444\tBXF1\n```\n\n ##### Taxonomic NER from a directory containing text files, with results written in the output directory\n\n``` console\n$ taxonerd ask --focus-on accuracy -i ./tests/test_data/test_txt -o test_ann\n$ ls test_ann/\ntest1.ann test2.ann\n$ cat test_ann/test2.ann\nT0\tLIVB 700 711\tBrown bears\nT1\tLIVB 713 725\tUrsus arctos\nT2\tLIVB 1062 1073\tbrown bears\nT3\tLIVB 1161 1172\tbrown bears\nT4\tLIVB 1339 1350\tbrown bears\nT5\tLIVB 1555 1565\tbrown bear\nT6\tLIVB 1782 1793\tbrown bears\nT7\tLIVB 1863 1874\tbrown bears\nT8\tLIVB 1958 1969\tbrown bears\nT9\tLIVB 1974 1980\tsalmon\nT10\tLIVB 2026 2037\tbrown bears\nT11\tLIVB 2219 2230\tbrown bears\nT12\tLIVB 2392 2401\tSika deer\nT13\tLIVB 2403 2416\tCervus nippon\nT14\tLIVB 2555 2559\tdeer\nT15\tLIVB 2594 2604\tbrown bear\nT16\tLIVB 2798 2808\tbrown bear\nT17\tLIVB 3146 3150\tdeer\nT18\tLIVB 3188 3199\tchum salmon\nT19\tLIVB 3201 3218\tOncorhynchus keta\nT20\tLIVB 3280 3289\tSika deer\nT21\tLIVB 3350 3361\tpink salmon\nT22\tLIVB 3363 3375\tO. gorbuscha\nT23\tLIVB 3381 3392\tchum salmon\nT24\tLIVB 3518 3528\tBrown bear\nT25\tLIVB 4001 4012\tbrown bears\nT26\tLIVB 4071 4082\tbrown bears\n```\n\n### Use as python module\n\n``` python\n>>> from taxonerd import TaxoNERD\n>>> taxonerd = TaxoNERD(prefer_gpu=False)\n>>> nlp = taxonerd.load(model=\"en_ner_eco_md\", exclude=[], linker=\"taxref\", threshold=0.7)\n>>> nlp.pipe_names\n['tok2vec', 'tagger', 'attribute_ruler', 'lemmatizer', 'pysbd_sentencizer', 'parser', 'ner', 'taxo_abbrev_detector', 'taxon_linker']\n```\n\n**N.B.** By default, all components are included in the pipeline. Use the ``exclude`` argument to specify the components to exclude. Excluded components won\u2019t be loaded. This may speed up the detection process. The minimal pipeline for taxonomic NER is ``['tok2vec', 'ner']``.\n\n#### Examples\n\n ##### Find taxonomic entities in an input string\n\n``` python\n>>> taxonerd.find_in_text(\"Brown bears (Ursus arctos), which are widely distributed throughout the northern hemisphere, are recognised as opportunistic omnivore\")\n offsets text entity sent\nT0 LIVB 13 25 Ursus arctos [(TAXREF:60826, Ursus arctos, 1.0)] 0\n```\n\n ##### Find taxonomic entities in an input file\n\n``` python\n>>> taxonerd.find_in_file(\"./tests/test_data/test_txt/test2.txt\", output_dir=None)\n offsets text entity sent\nT0 LIVB 713 725 Ursus arctos [(TAXREF:60826, Ursus arctos, 1.0)] 4\nT1 LIVB 1974 1980 salmon [(TAXREF:730671, Salmonia, 0.85158771276474)] 12\nT2 LIVB 2392 2401 Sika deer [(TAXREF:61025, Sika Deer, 1.0)] 14\nT3 LIVB 2403 2416 Cervus nippon [(TAXREF:61025, Cervus nippon, 1.0)] 14\nT4 LIVB 3135 3141 salmon [(TAXREF:730671, Salmonia, 0.85158771276474)] 18\nT5 LIVB 3146 3150 deer [(TAXREF:186210, deer, 1.0)] 18\nT6 LIVB 3188 3199 chum salmon [(TAXREF:730671, Salmonia, 0.7018352746963501)] 19\nT7 LIVB 3201 3218 Oncorhynchus keta [(TAXREF:195439, Oncorhynchus, 0.8319146037101... 19\nT8 LIVB 3280 3289 Sika deer [(TAXREF:61025, Sika Deer, 1.0)] 19\nT9 LIVB 3350 3361 pink salmon [(TAXREF:67798, Pink Salmon, 1.0)] 20\nT10 LIVB 3381 3392 chum salmon [(TAXREF:730671, Salmonia, 0.7018352746963501)] 20\nT11 LIVB 3481 3485 deer [(TAXREF:186210, deer, 1.0)] 20\n```\n\n ##### Find taxonomic entities in all the files in the input directory, and write the results in the output directory\n\n``` python\n>>> taxonerd.find_in_corpus(\"./tests/test_data/test_txt\", \"./test_ann\")\n{'test1.txt': './test_ann/test1.ann', 'test2.txt': './test_ann/test2.ann'}\n```\n\n### Use as spaCy pipeline\n``` python\n>>> from taxonerd import TaxoNERD\n>>> taxonerd = TaxoNERD(prefer_gpu=True)\n>>> nlp = taxonerd.load(model=\"en_ner_eco_biobert\")\n>>> doc = nlp(\"Brown bears (Ursus arctos), which are widely distributed throughout the northern hemisphere, are recognised as opportunistic omnivore\")\n>>> doc.ents\n(Brown bears, Ursus arctos)\n>>> [tok.lemma_ for tok in doc]\n['Brown', 'bear', '(', 'ursus', 'arcto', ')', ',', 'which', 'be', 'widely', 'distribute', 'throughout', 'the', 'northern', 'hemisphere', ',', 'be', 'recognise', 'as', 'opportunistic', 'omnivore']\n```\n\nMore examples in our [demo Notebook](https://github.com/nleguillarme/taxonerd/blob/9f5b1e264ba129eeeda383aa8085605c8fa9b379/taxonerd-demo.ipynb).\n\n## Extensions\n\n* [Combining TaxoNERD with gazetteer-based NER for improved taxonomic entities recognition](https://github.com/nleguillarme/taxonerd/blob/a58808e5808d74e341d0d98bc64dfebd7a670b81/extensions/entity_ruler.ipynb)\n\n## License\n\nLicense: MIT\n\n## Authors\n\nTaxoNERD was written by [nleguillarme](https://github.com/nleguillarme/).\n",
"bugtrack_url": null,
"license": "MIT License",
"summary": "A Python package and CLI tool based on spaCy for detecting mentions of taxonomic entities in text",
"version": "1.5.4",
"project_urls": {
"Bug Tracker": "https://github.com/nleguillarme/taxonerd/issues",
"Homepage": "https://github.com/nleguillarme/taxonerd"
},
"split_keywords": [
"spacy",
" ner",
" transformers",
" deep neural networks",
" ecology",
" evolution"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "4470e01b441656ce2c4ec86fce9448ea3dd6d340211a675699ae8ae7d06b9806",
"md5": "fbf7ea97d9909eec03aae48cd1834a5f",
"sha256": "7dccb0083d2a57b847bb54778f28bc5d52dedfb9fbca665fc115aa368f005b45"
},
"downloads": -1,
"filename": "taxonerd-1.5.4.tar.gz",
"has_sig": false,
"md5_digest": "fbf7ea97d9909eec03aae48cd1834a5f",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 31216,
"upload_time": "2024-06-12T08:11:47",
"upload_time_iso_8601": "2024-06-12T08:11:47.943407Z",
"url": "https://files.pythonhosted.org/packages/44/70/e01b441656ce2c4ec86fce9448ea3dd6d340211a675699ae8ae7d06b9806/taxonerd-1.5.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-06-12 08:11:47",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "nleguillarme",
"github_project": "taxonerd",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "taxonerd"
}