<a href="https://explosion.ai"><img src="https://explosion.ai/assets/img/logo.svg" width="125" height="125" align="right" /></a>
# spacy-huggingface-pipelines: Use pretrained transformer models for text and token classification
This package provides [spaCy](https://github.com/explosion/spaCy) components to
use pretrained
[Hugging Face Transformers pipelines](https://huggingface.co/docs/transformers/main_classes/pipelines)
for inference only.
[![PyPi](https://img.shields.io/pypi/v/spacy-huggingface-pipelines.svg?style=flat-square&logo=pypi&logoColor=white)](https://pypi.python.org/pypi/spacy-huggingface-pipelines)
[![GitHub](https://img.shields.io/github/release/explosion/spacy-huggingface-pipelines/all.svg?style=flat-square&logo=github)](https://github.com/explosion/spacy-huggingface-pipelines/releases)
## Features
- Apply pretrained transformers models like
[`dslim/bert-base-NER`](https://huggingface.co/dslim/bert-base-NER) and
[`distilbert-base-uncased-finetuned-sst-2-english`](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
## 🚀 Installation
Installing the package from pip will automatically install all dependencies,
including PyTorch and spaCy.
```bash
pip install -U pip setuptools wheel
pip install spacy-huggingface-pipelines
```
For GPU installation, follow the
[spaCy installation quickstart with GPU](https://spacy.io/usage/), e.g.
```bash
pip install -U spacy[cuda-autodetect]
```
If you are having trouble installing PyTorch, follow the
[instructions](https://pytorch.org/get-started/locally/) on the official website
for your specific operating system and requirements.
## 📖 Documentation
This module provides spaCy wrappers for the inference-only transformers
[`TokenClassificationPipeline`](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.TokenClassificationPipeline)
and
[`TextClassificationPipeline`](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.TextClassificationPipeline)
pipelines.
The models are downloaded on initialization from the
[Hugging Face Hub](https://huggingface.co/models) if they're not already in your
local cache, or alternatively they can be loaded from a local path.
Note that the transformer model data **is not saved with the pipeline** when you
call `nlp.to_disk`, so if you are loading pipelines in an environment with
limited internet access, make sure the model is available in your
[transformers cache directory](https://huggingface.co/docs/transformers/main/en/installation#cache-setup)
and enable offline mode if needed.
### Token classification
Config settings for `hf_token_pipe`:
```ini
[components.hf_token_pipe]
factory = "hf_token_pipe"
model = "dslim/bert-base-NER" # Model name or path
revision = "main" # Model revision
aggregation_strategy = "average" # "simple", "first", "average", "max"
stride = 16 # If stride >= 0, process long texts in
# overlapping windows of the model max
# length. The value is the length of the
# window overlap in transformer tokenizer
# tokens, NOT the length of the stride.
kwargs = {} # Any additional arguments for
# TokenClassificationPipeline
alignment_mode = "strict" # "strict", "contract", "expand"
annotate = "ents" # "ents", "pos", "spans", "tag"
annotate_spans_key = null # Doc.spans key for annotate = "spans"
scorer = null # Optional scorer
```
#### `TokenClassificationPipeline` settings
- `model`: The model name or path.
- `revision`: The model revision. For production use, a specific git commit is
recommended instead of the default `main`.
- `stride`: For `stride >= 0`, the text is processed in overlapping windows
where the `stride` setting specifies the number of overlapping tokens between
windows (NOT the stride length). If `stride` is `None`, then the text may be
truncated. `stride` is only supported for fast tokenizers.
- `aggregation_strategy`: The aggregation strategy determines the word-level
tags for cases where subwords within one word do not receive the same
predicted tag. See:
https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TokenClassificationPipeline.aggregation_strategy
- `kwargs`: Any additional arguments to
[`TokenClassificationPipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TokenClassificationPipeline).
#### spaCy settings
- `alignment_mode` determines how transformer predictions are aligned to spaCy
token boundaries as described for
[`Doc.char_span`](https://spacy.io/api/doc#char_span).
- `annotate` and `annotate_spans_key` configure how the annotation is saved to
the spaCy doc. You can save the output as `token.tag_`, `token.pos_` (only for
UPOS tags), `doc.ents` or `doc.spans`.
#### Examples
1. Save named entity annotation as `Doc.ents`:
```python
import spacy
nlp = spacy.blank("en")
nlp.add_pipe("hf_token_pipe", config={"model": "dslim/bert-base-NER"})
doc = nlp("My name is Sarah and I live in London")
print(doc.ents)
# (Sarah, London)
```
2. Save named entity annotation as `Doc.spans[spans_key]` and scores as
`Doc.spans[spans_key].attrs["scores"]`:
```python
import spacy
nlp = spacy.blank("en")
nlp.add_pipe(
"hf_token_pipe",
config={
"model": "dslim/bert-base-NER",
"annotate": "spans",
"annotate_spans_key": "bert-base-ner",
},
)
doc = nlp("My name is Sarah and I live in London")
print(doc.spans["bert-base-ner"])
# [Sarah, London]
print(doc.spans["bert-base-ner"].attrs["scores"])
# [0.99854773, 0.9996215]
```
3. Save fine-grained tags as `Token.tag`:
```python
import spacy
nlp = spacy.blank("en")
nlp.add_pipe(
"hf_token_pipe",
config={
"model": "QCRI/bert-base-multilingual-cased-pos-english",
"annotate": "tag",
},
)
doc = nlp("My name is Sarah and I live in London")
print([t.tag_ for t in doc])
# ['PRP$', 'NN', 'VBZ', 'NNP', 'CC', 'PRP', 'VBP', 'IN', 'NNP']
```
4. Save coarse-grained tags as `Token.pos`:
```python
import spacy
nlp = spacy.blank("en")
nlp.add_pipe(
"hf_token_pipe",
config={"model": "vblagoje/bert-english-uncased-finetuned-pos", "annotate": "pos"},
)
doc = nlp("My name is Sarah and I live in London")
print([t.pos_ for t in doc])
# ['PRON', 'NOUN', 'AUX', 'PROPN', 'CCONJ', 'PRON', 'VERB', 'ADP', 'PROPN']
```
### Text classification
Config settings for `hf_text_pipe`:
```ini
[components.hf_text_pipe]
factory = "hf_text_pipe"
model = "distilbert-base-uncased-finetuned-sst-2-english" # Model name or path
revision = "main" # Model revision
kwargs = {} # Any additional arguments for
# TextClassificationPipeline
scorer = null # Optional scorer
```
The input texts are truncated according to the transformers model max length.
#### `TextClassificationPipeline` settings
- `model`: The model name or path.
- `revision`: The model revision. For production use, a specific git commit is
recommended instead of the default `main`.
- `kwargs`: Any additional arguments to
[`TextClassificationPipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TextClassificationPipeline).
#### Example
```python
import spacy
nlp = spacy.blank("en")
nlp.add_pipe(
"hf_text_pipe",
config={"model": "distilbert-base-uncased-finetuned-sst-2-english"},
)
doc = nlp("This is great!")
print(doc.cats)
# {'POSITIVE': 0.9998694658279419, 'NEGATIVE': 0.00013048505934420973}
```
### Batching and GPU
Both token and text classification support batching with `nlp.pipe`:
```python
for doc in nlp.pipe(texts, batch_size=256):
do_something(doc)
```
If the component runs into an error processing a batch (e.g. on an empty text),
`nlp.pipe` will back off to processing each text individually. If it runs into
an error on an individual text, a warning is shown and the doc is returned
without additional annotation.
Switch to GPU:
```python
import spacy
spacy.require_gpu()
for doc in nlp.pipe(texts):
do_something(doc)
```
## Bug reports and issues
Please report bugs in the
[spaCy issue tracker](https://github.com/explosion/spaCy/issues) or open a new
thread on the [discussion board](https://github.com/explosion/spaCy/discussions)
for other issues.
Raw data
{
"_id": null,
"home_page": "https://github.com/explosion/spacy-huggingface-pipelines",
"name": "spacy-huggingface-pipelines",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "",
"keywords": "",
"author": "Explosion",
"author_email": "contact@explosion.ai",
"download_url": "https://files.pythonhosted.org/packages/38/ca/07667af54b4efb3ee204db6db6ba9a3e7d7baf59219e5c86f7888121be06/spacy_huggingface_pipelines-0.0.4.tar.gz",
"platform": null,
"description": "<a href=\"https://explosion.ai\"><img src=\"https://explosion.ai/assets/img/logo.svg\" width=\"125\" height=\"125\" align=\"right\" /></a>\n\n# spacy-huggingface-pipelines: Use pretrained transformer models for text and token classification\n\nThis package provides [spaCy](https://github.com/explosion/spaCy) components to\nuse pretrained\n[Hugging Face Transformers pipelines](https://huggingface.co/docs/transformers/main_classes/pipelines)\nfor inference only.\n\n[![PyPi](https://img.shields.io/pypi/v/spacy-huggingface-pipelines.svg?style=flat-square&logo=pypi&logoColor=white)](https://pypi.python.org/pypi/spacy-huggingface-pipelines)\n[![GitHub](https://img.shields.io/github/release/explosion/spacy-huggingface-pipelines/all.svg?style=flat-square&logo=github)](https://github.com/explosion/spacy-huggingface-pipelines/releases)\n\n## Features\n\n- Apply pretrained transformers models like\n [`dslim/bert-base-NER`](https://huggingface.co/dslim/bert-base-NER) and\n [`distilbert-base-uncased-finetuned-sst-2-english`](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).\n\n## \ud83d\ude80 Installation\n\nInstalling the package from pip will automatically install all dependencies,\nincluding PyTorch and spaCy.\n\n```bash\npip install -U pip setuptools wheel\npip install spacy-huggingface-pipelines\n```\n\nFor GPU installation, follow the\n[spaCy installation quickstart with GPU](https://spacy.io/usage/), e.g.\n\n```bash\npip install -U spacy[cuda-autodetect]\n```\n\nIf you are having trouble installing PyTorch, follow the\n[instructions](https://pytorch.org/get-started/locally/) on the official website\nfor your specific operating system and requirements.\n\n## \ud83d\udcd6 Documentation\n\nThis module provides spaCy wrappers for the inference-only transformers\n[`TokenClassificationPipeline`](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.TokenClassificationPipeline)\nand\n[`TextClassificationPipeline`](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.TextClassificationPipeline)\npipelines.\n\nThe models are downloaded on initialization from the\n[Hugging Face Hub](https://huggingface.co/models) if they're not already in your\nlocal cache, or alternatively they can be loaded from a local path.\n\nNote that the transformer model data **is not saved with the pipeline** when you\ncall `nlp.to_disk`, so if you are loading pipelines in an environment with\nlimited internet access, make sure the model is available in your\n[transformers cache directory](https://huggingface.co/docs/transformers/main/en/installation#cache-setup)\nand enable offline mode if needed.\n\n### Token classification\n\nConfig settings for `hf_token_pipe`:\n\n```ini\n[components.hf_token_pipe]\nfactory = \"hf_token_pipe\"\nmodel = \"dslim/bert-base-NER\" # Model name or path\nrevision = \"main\" # Model revision\naggregation_strategy = \"average\" # \"simple\", \"first\", \"average\", \"max\"\nstride = 16 # If stride >= 0, process long texts in\n # overlapping windows of the model max\n # length. The value is the length of the\n # window overlap in transformer tokenizer\n # tokens, NOT the length of the stride.\nkwargs = {} # Any additional arguments for\n # TokenClassificationPipeline\nalignment_mode = \"strict\" # \"strict\", \"contract\", \"expand\"\nannotate = \"ents\" # \"ents\", \"pos\", \"spans\", \"tag\"\nannotate_spans_key = null # Doc.spans key for annotate = \"spans\"\nscorer = null # Optional scorer\n```\n\n#### `TokenClassificationPipeline` settings\n\n- `model`: The model name or path.\n- `revision`: The model revision. For production use, a specific git commit is\n recommended instead of the default `main`.\n- `stride`: For `stride >= 0`, the text is processed in overlapping windows\n where the `stride` setting specifies the number of overlapping tokens between\n windows (NOT the stride length). If `stride` is `None`, then the text may be\n truncated. `stride` is only supported for fast tokenizers.\n- `aggregation_strategy`: The aggregation strategy determines the word-level\n tags for cases where subwords within one word do not receive the same\n predicted tag. See:\n https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TokenClassificationPipeline.aggregation_strategy\n- `kwargs`: Any additional arguments to\n [`TokenClassificationPipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TokenClassificationPipeline).\n\n#### spaCy settings\n\n- `alignment_mode` determines how transformer predictions are aligned to spaCy\n token boundaries as described for\n [`Doc.char_span`](https://spacy.io/api/doc#char_span).\n- `annotate` and `annotate_spans_key` configure how the annotation is saved to\n the spaCy doc. You can save the output as `token.tag_`, `token.pos_` (only for\n UPOS tags), `doc.ents` or `doc.spans`.\n\n#### Examples\n\n1. Save named entity annotation as `Doc.ents`:\n\n```python\nimport spacy\nnlp = spacy.blank(\"en\")\nnlp.add_pipe(\"hf_token_pipe\", config={\"model\": \"dslim/bert-base-NER\"})\ndoc = nlp(\"My name is Sarah and I live in London\")\nprint(doc.ents)\n# (Sarah, London)\n```\n\n2. Save named entity annotation as `Doc.spans[spans_key]` and scores as\n `Doc.spans[spans_key].attrs[\"scores\"]`:\n\n```python\nimport spacy\nnlp = spacy.blank(\"en\")\nnlp.add_pipe(\n \"hf_token_pipe\",\n config={\n \"model\": \"dslim/bert-base-NER\",\n \"annotate\": \"spans\",\n \"annotate_spans_key\": \"bert-base-ner\",\n },\n)\ndoc = nlp(\"My name is Sarah and I live in London\")\nprint(doc.spans[\"bert-base-ner\"])\n# [Sarah, London]\nprint(doc.spans[\"bert-base-ner\"].attrs[\"scores\"])\n# [0.99854773, 0.9996215]\n```\n\n3. Save fine-grained tags as `Token.tag`:\n\n```python\nimport spacy\nnlp = spacy.blank(\"en\")\nnlp.add_pipe(\n \"hf_token_pipe\",\n config={\n \"model\": \"QCRI/bert-base-multilingual-cased-pos-english\",\n \"annotate\": \"tag\",\n },\n)\ndoc = nlp(\"My name is Sarah and I live in London\")\nprint([t.tag_ for t in doc])\n# ['PRP$', 'NN', 'VBZ', 'NNP', 'CC', 'PRP', 'VBP', 'IN', 'NNP']\n```\n\n4. Save coarse-grained tags as `Token.pos`:\n\n```python\nimport spacy\nnlp = spacy.blank(\"en\")\nnlp.add_pipe(\n \"hf_token_pipe\",\n config={\"model\": \"vblagoje/bert-english-uncased-finetuned-pos\", \"annotate\": \"pos\"},\n)\ndoc = nlp(\"My name is Sarah and I live in London\")\nprint([t.pos_ for t in doc])\n# ['PRON', 'NOUN', 'AUX', 'PROPN', 'CCONJ', 'PRON', 'VERB', 'ADP', 'PROPN']\n```\n\n### Text classification\n\nConfig settings for `hf_text_pipe`:\n\n```ini\n[components.hf_text_pipe]\nfactory = \"hf_text_pipe\"\nmodel = \"distilbert-base-uncased-finetuned-sst-2-english\" # Model name or path\nrevision = \"main\" # Model revision\nkwargs = {} # Any additional arguments for\n # TextClassificationPipeline\nscorer = null # Optional scorer\n```\n\nThe input texts are truncated according to the transformers model max length.\n\n#### `TextClassificationPipeline` settings\n\n- `model`: The model name or path.\n- `revision`: The model revision. For production use, a specific git commit is\n recommended instead of the default `main`.\n- `kwargs`: Any additional arguments to\n [`TextClassificationPipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TextClassificationPipeline).\n\n#### Example\n\n```python\nimport spacy\n\nnlp = spacy.blank(\"en\")\nnlp.add_pipe(\n \"hf_text_pipe\",\n config={\"model\": \"distilbert-base-uncased-finetuned-sst-2-english\"},\n)\ndoc = nlp(\"This is great!\")\nprint(doc.cats)\n# {'POSITIVE': 0.9998694658279419, 'NEGATIVE': 0.00013048505934420973}\n```\n\n### Batching and GPU\n\nBoth token and text classification support batching with `nlp.pipe`:\n\n```python\nfor doc in nlp.pipe(texts, batch_size=256):\n do_something(doc)\n```\n\nIf the component runs into an error processing a batch (e.g. on an empty text),\n`nlp.pipe` will back off to processing each text individually. If it runs into\nan error on an individual text, a warning is shown and the doc is returned\nwithout additional annotation.\n\nSwitch to GPU:\n\n```python\nimport spacy\nspacy.require_gpu()\n\nfor doc in nlp.pipe(texts):\n do_something(doc)\n```\n\n## Bug reports and issues\n\nPlease report bugs in the\n[spaCy issue tracker](https://github.com/explosion/spaCy/issues) or open a new\nthread on the [discussion board](https://github.com/explosion/spaCy/discussions)\nfor other issues.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "spaCy wrapper for Hugging Face Transformers pipelines",
"version": "0.0.4",
"project_urls": {
"Homepage": "https://github.com/explosion/spacy-huggingface-pipelines",
"Release notes": "https://github.com/explosion/spacy-huggingface-pipelines/releases",
"Source": "https://github.com/explosion/spacy-huggingface-pipelines"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "ba691cf6333eebaadf8517f59b9dec676f42f5fef8b13a29eaf2cd2922470868",
"md5": "b52ebc695fda9cffba6ccb53b0758b5c",
"sha256": "9e38ee6eba7a11fca32b7d14f38259f7805eec211e8959105a90c95915168b00"
},
"downloads": -1,
"filename": "spacy_huggingface_pipelines-0.0.4-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "b52ebc695fda9cffba6ccb53b0758b5c",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": ">=3.8",
"size": 11236,
"upload_time": "2023-06-02T18:18:02",
"upload_time_iso_8601": "2023-06-02T18:18:02.363686Z",
"url": "https://files.pythonhosted.org/packages/ba/69/1cf6333eebaadf8517f59b9dec676f42f5fef8b13a29eaf2cd2922470868/spacy_huggingface_pipelines-0.0.4-py2.py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "38ca07667af54b4efb3ee204db6db6ba9a3e7d7baf59219e5c86f7888121be06",
"md5": "a967cf1c4dab40128fe57b518177cee3",
"sha256": "35b409ed7d20c5b36d788912570e3444ec1b0c344255e847bf722b3286279e95"
},
"downloads": -1,
"filename": "spacy_huggingface_pipelines-0.0.4.tar.gz",
"has_sig": false,
"md5_digest": "a967cf1c4dab40128fe57b518177cee3",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 11685,
"upload_time": "2023-06-02T18:18:03",
"upload_time_iso_8601": "2023-06-02T18:18:03.859806Z",
"url": "https://files.pythonhosted.org/packages/38/ca/07667af54b4efb3ee204db6db6ba9a3e7d7baf59219e5c86f7888121be06/spacy_huggingface_pipelines-0.0.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-06-02 18:18:03",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "explosion",
"github_project": "spacy-huggingface-pipelines",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "spacy-huggingface-pipelines"
}