spacy-huggingface-pipelines

Name	spacy-huggingface-pipelines JSON
Version	0.0.4 JSON
	download
home_page	https://github.com/explosion/spacy-huggingface-pipelines
Summary	spaCy wrapper for Hugging Face Transformers pipelines
upload_time	2023-06-02 18:18:03
maintainer
docs_url	None
author	Explosion
requires_python	>=3.8
license	MIT
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <a href="https://explosion.ai"><img src="https://explosion.ai/assets/img/logo.svg" width="125" height="125" align="right" /></a>

# spacy-huggingface-pipelines: Use pretrained transformer models for text and token classification

This package provides [spaCy](https://github.com/explosion/spaCy) components to
use pretrained
[Hugging Face Transformers pipelines](https://huggingface.co/docs/transformers/main_classes/pipelines)
for inference only.

[![PyPi](https://img.shields.io/pypi/v/spacy-huggingface-pipelines.svg?style=flat-square&logo=pypi&logoColor=white)](https://pypi.python.org/pypi/spacy-huggingface-pipelines)
[![GitHub](https://img.shields.io/github/release/explosion/spacy-huggingface-pipelines/all.svg?style=flat-square&logo=github)](https://github.com/explosion/spacy-huggingface-pipelines/releases)

## Features

- Apply pretrained transformers models like
  [`dslim/bert-base-NER`](https://huggingface.co/dslim/bert-base-NER) and
  [`distilbert-base-uncased-finetuned-sst-2-english`](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).

## 🚀 Installation

Installing the package from pip will automatically install all dependencies,
including PyTorch and spaCy.

```bash
pip install -U pip setuptools wheel
pip install spacy-huggingface-pipelines
```

For GPU installation, follow the
[spaCy installation quickstart with GPU](https://spacy.io/usage/), e.g.

```bash
pip install -U spacy[cuda-autodetect]
```

If you are having trouble installing PyTorch, follow the
[instructions](https://pytorch.org/get-started/locally/) on the official website
for your specific operating system and requirements.

## 📖 Documentation

This module provides spaCy wrappers for the inference-only transformers
[`TokenClassificationPipeline`](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.TokenClassificationPipeline)
and
[`TextClassificationPipeline`](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.TextClassificationPipeline)
pipelines.

The models are downloaded on initialization from the
[Hugging Face Hub](https://huggingface.co/models) if they're not already in your
local cache, or alternatively they can be loaded from a local path.

Note that the transformer model data **is not saved with the pipeline** when you
call `nlp.to_disk`, so if you are loading pipelines in an environment with
limited internet access, make sure the model is available in your
[transformers cache directory](https://huggingface.co/docs/transformers/main/en/installation#cache-setup)
and enable offline mode if needed.

### Token classification

Config settings for `hf_token_pipe`:

```ini
[components.hf_token_pipe]
factory = "hf_token_pipe"
model = "dslim/bert-base-NER"     # Model name or path
revision = "main"                 # Model revision
aggregation_strategy = "average"  # "simple", "first", "average", "max"
stride = 16                       # If stride >= 0, process long texts in
                                  # overlapping windows of the model max
                                  # length. The value is the length of the
                                  # window overlap in transformer tokenizer
                                  # tokens, NOT the length of the stride.
kwargs = {}                       # Any additional arguments for
                                  # TokenClassificationPipeline
alignment_mode = "strict"         # "strict", "contract", "expand"
annotate = "ents"                 # "ents", "pos", "spans", "tag"
annotate_spans_key = null         # Doc.spans key for annotate = "spans"
scorer = null                     # Optional scorer
```

#### `TokenClassificationPipeline` settings

- `model`: The model name or path.
- `revision`: The model revision. For production use, a specific git commit is
  recommended instead of the default `main`.
- `stride`: For `stride >= 0`, the text is processed in overlapping windows
  where the `stride` setting specifies the number of overlapping tokens between
  windows (NOT the stride length). If `stride` is `None`, then the text may be
  truncated. `stride` is only supported for fast tokenizers.
- `aggregation_strategy`: The aggregation strategy determines the word-level
  tags for cases where subwords within one word do not receive the same
  predicted tag. See:
  https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TokenClassificationPipeline.aggregation_strategy
- `kwargs`: Any additional arguments to
  [`TokenClassificationPipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TokenClassificationPipeline).

#### spaCy settings

- `alignment_mode` determines how transformer predictions are aligned to spaCy
  token boundaries as described for
  [`Doc.char_span`](https://spacy.io/api/doc#char_span).
- `annotate` and `annotate_spans_key` configure how the annotation is saved to
  the spaCy doc. You can save the output as `token.tag_`, `token.pos_` (only for
  UPOS tags), `doc.ents` or `doc.spans`.

#### Examples

1. Save named entity annotation as `Doc.ents`:

```python
import spacy
nlp = spacy.blank("en")
nlp.add_pipe("hf_token_pipe", config={"model": "dslim/bert-base-NER"})
doc = nlp("My name is Sarah and I live in London")
print(doc.ents)
# (Sarah, London)
```

2. Save named entity annotation as `Doc.spans[spans_key]` and scores as
   `Doc.spans[spans_key].attrs["scores"]`:

```python
import spacy
nlp = spacy.blank("en")
nlp.add_pipe(
    "hf_token_pipe",
    config={
        "model": "dslim/bert-base-NER",
        "annotate": "spans",
        "annotate_spans_key": "bert-base-ner",
    },
)
doc = nlp("My name is Sarah and I live in London")
print(doc.spans["bert-base-ner"])
# [Sarah, London]
print(doc.spans["bert-base-ner"].attrs["scores"])
# [0.99854773, 0.9996215]
```

3. Save fine-grained tags as `Token.tag`:

```python
import spacy
nlp = spacy.blank("en")
nlp.add_pipe(
    "hf_token_pipe",
    config={
        "model": "QCRI/bert-base-multilingual-cased-pos-english",
        "annotate": "tag",
    },
)
doc = nlp("My name is Sarah and I live in London")
print([t.tag_ for t in doc])
# ['PRP$', 'NN', 'VBZ', 'NNP', 'CC', 'PRP', 'VBP', 'IN', 'NNP']
```

4. Save coarse-grained tags as `Token.pos`:

```python
import spacy
nlp = spacy.blank("en")
nlp.add_pipe(
    "hf_token_pipe",
    config={"model": "vblagoje/bert-english-uncased-finetuned-pos", "annotate": "pos"},
)
doc = nlp("My name is Sarah and I live in London")
print([t.pos_ for t in doc])
# ['PRON', 'NOUN', 'AUX', 'PROPN', 'CCONJ', 'PRON', 'VERB', 'ADP', 'PROPN']
```

### Text classification

Config settings for `hf_text_pipe`:

```ini
[components.hf_text_pipe]
factory = "hf_text_pipe"
model = "distilbert-base-uncased-finetuned-sst-2-english"  # Model name or path
revision = "main"                 # Model revision
kwargs = {}                       # Any additional arguments for
                                  # TextClassificationPipeline
scorer = null                     # Optional scorer
```

The input texts are truncated according to the transformers model max length.

#### `TextClassificationPipeline` settings

- `model`: The model name or path.
- `revision`: The model revision. For production use, a specific git commit is
  recommended instead of the default `main`.
- `kwargs`: Any additional arguments to
  [`TextClassificationPipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TextClassificationPipeline).

#### Example

```python
import spacy

nlp = spacy.blank("en")
nlp.add_pipe(
    "hf_text_pipe",
    config={"model": "distilbert-base-uncased-finetuned-sst-2-english"},
)
doc = nlp("This is great!")
print(doc.cats)
# {'POSITIVE': 0.9998694658279419, 'NEGATIVE': 0.00013048505934420973}
```

### Batching and GPU

Both token and text classification support batching with `nlp.pipe`:

```python
for doc in nlp.pipe(texts, batch_size=256):
    do_something(doc)
```

If the component runs into an error processing a batch (e.g. on an empty text),
`nlp.pipe` will back off to processing each text individually. If it runs into
an error on an individual text, a warning is shown and the doc is returned
without additional annotation.

Switch to GPU:

```python
import spacy
spacy.require_gpu()

for doc in nlp.pipe(texts):
    do_something(doc)
```

## Bug reports and issues

Please report bugs in the
[spaCy issue tracker](https://github.com/explosion/spaCy/issues) or open a new
thread on the [discussion board](https://github.com/explosion/spaCy/discussions)
for other issues.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/explosion/spacy-huggingface-pipelines",
    "name": "spacy-huggingface-pipelines",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "",
    "author": "Explosion",
    "author_email": "contact@explosion.ai",
    "download_url": "https://files.pythonhosted.org/packages/38/ca/07667af54b4efb3ee204db6db6ba9a3e7d7baf59219e5c86f7888121be06/spacy_huggingface_pipelines-0.0.4.tar.gz",
    "platform": null,
    "description": "<a href=\"https://explosion.ai\"><img src=\"https://explosion.ai/assets/img/logo.svg\" width=\"125\" height=\"125\" align=\"right\" /></a>\n\n# spacy-huggingface-pipelines: Use pretrained transformer models for text and token classification\n\nThis package provides [spaCy](https://github.com/explosion/spaCy) components to\nuse pretrained\n[Hugging Face Transformers pipelines](https://huggingface.co/docs/transformers/main_classes/pipelines)\nfor inference only.\n\n[![PyPi](https://img.shields.io/pypi/v/spacy-huggingface-pipelines.svg?style=flat-square&logo=pypi&logoColor=white)](https://pypi.python.org/pypi/spacy-huggingface-pipelines)\n[![GitHub](https://img.shields.io/github/release/explosion/spacy-huggingface-pipelines/all.svg?style=flat-square&logo=github)](https://github.com/explosion/spacy-huggingface-pipelines/releases)\n\n## Features\n\n- Apply pretrained transformers models like\n  [`dslim/bert-base-NER`](https://huggingface.co/dslim/bert-base-NER) and\n  [`distilbert-base-uncased-finetuned-sst-2-english`](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).\n\n## \ud83d\ude80 Installation\n\nInstalling the package from pip will automatically install all dependencies,\nincluding PyTorch and spaCy.\n\n```bash\npip install -U pip setuptools wheel\npip install spacy-huggingface-pipelines\n```\n\nFor GPU installation, follow the\n[spaCy installation quickstart with GPU](https://spacy.io/usage/), e.g.\n\n```bash\npip install -U spacy[cuda-autodetect]\n```\n\nIf you are having trouble installing PyTorch, follow the\n[instructions](https://pytorch.org/get-started/locally/) on the official website\nfor your specific operating system and requirements.\n\n## \ud83d\udcd6 Documentation\n\nThis module provides spaCy wrappers for the inference-only transformers\n[`TokenClassificationPipeline`](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.TokenClassificationPipeline)\nand\n[`TextClassificationPipeline`](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.TextClassificationPipeline)\npipelines.\n\nThe models are downloaded on initialization from the\n[Hugging Face Hub](https://huggingface.co/models) if they're not already in your\nlocal cache, or alternatively they can be loaded from a local path.\n\nNote that the transformer model data **is not saved with the pipeline** when you\ncall `nlp.to_disk`, so if you are loading pipelines in an environment with\nlimited internet access, make sure the model is available in your\n[transformers cache directory](https://huggingface.co/docs/transformers/main/en/installation#cache-setup)\nand enable offline mode if needed.\n\n### Token classification\n\nConfig settings for `hf_token_pipe`:\n\n```ini\n[components.hf_token_pipe]\nfactory = \"hf_token_pipe\"\nmodel = \"dslim/bert-base-NER\"     # Model name or path\nrevision = \"main\"                 # Model revision\naggregation_strategy = \"average\"  # \"simple\", \"first\", \"average\", \"max\"\nstride = 16                       # If stride >= 0, process long texts in\n                                  # overlapping windows of the model max\n                                  # length. The value is the length of the\n                                  # window overlap in transformer tokenizer\n                                  # tokens, NOT the length of the stride.\nkwargs = {}                       # Any additional arguments for\n                                  # TokenClassificationPipeline\nalignment_mode = \"strict\"         # \"strict\", \"contract\", \"expand\"\nannotate = \"ents\"                 # \"ents\", \"pos\", \"spans\", \"tag\"\nannotate_spans_key = null         # Doc.spans key for annotate = \"spans\"\nscorer = null                     # Optional scorer\n```\n\n#### `TokenClassificationPipeline` settings\n\n- `model`: The model name or path.\n- `revision`: The model revision. For production use, a specific git commit is\n  recommended instead of the default `main`.\n- `stride`: For `stride >= 0`, the text is processed in overlapping windows\n  where the `stride` setting specifies the number of overlapping tokens between\n  windows (NOT the stride length). If `stride` is `None`, then the text may be\n  truncated. `stride` is only supported for fast tokenizers.\n- `aggregation_strategy`: The aggregation strategy determines the word-level\n  tags for cases where subwords within one word do not receive the same\n  predicted tag. See:\n  https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TokenClassificationPipeline.aggregation_strategy\n- `kwargs`: Any additional arguments to\n  [`TokenClassificationPipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TokenClassificationPipeline).\n\n#### spaCy settings\n\n- `alignment_mode` determines how transformer predictions are aligned to spaCy\n  token boundaries as described for\n  [`Doc.char_span`](https://spacy.io/api/doc#char_span).\n- `annotate` and `annotate_spans_key` configure how the annotation is saved to\n  the spaCy doc. You can save the output as `token.tag_`, `token.pos_` (only for\n  UPOS tags), `doc.ents` or `doc.spans`.\n\n#### Examples\n\n1. Save named entity annotation as `Doc.ents`:\n\n```python\nimport spacy\nnlp = spacy.blank(\"en\")\nnlp.add_pipe(\"hf_token_pipe\", config={\"model\": \"dslim/bert-base-NER\"})\ndoc = nlp(\"My name is Sarah and I live in London\")\nprint(doc.ents)\n# (Sarah, London)\n```\n\n2. Save named entity annotation as `Doc.spans[spans_key]` and scores as\n   `Doc.spans[spans_key].attrs[\"scores\"]`:\n\n```python\nimport spacy\nnlp = spacy.blank(\"en\")\nnlp.add_pipe(\n    \"hf_token_pipe\",\n    config={\n        \"model\": \"dslim/bert-base-NER\",\n        \"annotate\": \"spans\",\n        \"annotate_spans_key\": \"bert-base-ner\",\n    },\n)\ndoc = nlp(\"My name is Sarah and I live in London\")\nprint(doc.spans[\"bert-base-ner\"])\n# [Sarah, London]\nprint(doc.spans[\"bert-base-ner\"].attrs[\"scores\"])\n# [0.99854773, 0.9996215]\n```\n\n3. Save fine-grained tags as `Token.tag`:\n\n```python\nimport spacy\nnlp = spacy.blank(\"en\")\nnlp.add_pipe(\n    \"hf_token_pipe\",\n    config={\n        \"model\": \"QCRI/bert-base-multilingual-cased-pos-english\",\n        \"annotate\": \"tag\",\n    },\n)\ndoc = nlp(\"My name is Sarah and I live in London\")\nprint([t.tag_ for t in doc])\n# ['PRP$', 'NN', 'VBZ', 'NNP', 'CC', 'PRP', 'VBP', 'IN', 'NNP']\n```\n\n4. Save coarse-grained tags as `Token.pos`:\n\n```python\nimport spacy\nnlp = spacy.blank(\"en\")\nnlp.add_pipe(\n    \"hf_token_pipe\",\n    config={\"model\": \"vblagoje/bert-english-uncased-finetuned-pos\", \"annotate\": \"pos\"},\n)\ndoc = nlp(\"My name is Sarah and I live in London\")\nprint([t.pos_ for t in doc])\n# ['PRON', 'NOUN', 'AUX', 'PROPN', 'CCONJ', 'PRON', 'VERB', 'ADP', 'PROPN']\n```\n\n### Text classification\n\nConfig settings for `hf_text_pipe`:\n\n```ini\n[components.hf_text_pipe]\nfactory = \"hf_text_pipe\"\nmodel = \"distilbert-base-uncased-finetuned-sst-2-english\"  # Model name or path\nrevision = \"main\"                 # Model revision\nkwargs = {}                       # Any additional arguments for\n                                  # TextClassificationPipeline\nscorer = null                     # Optional scorer\n```\n\nThe input texts are truncated according to the transformers model max length.\n\n#### `TextClassificationPipeline` settings\n\n- `model`: The model name or path.\n- `revision`: The model revision. For production use, a specific git commit is\n  recommended instead of the default `main`.\n- `kwargs`: Any additional arguments to\n  [`TextClassificationPipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TextClassificationPipeline).\n\n#### Example\n\n```python\nimport spacy\n\nnlp = spacy.blank(\"en\")\nnlp.add_pipe(\n    \"hf_text_pipe\",\n    config={\"model\": \"distilbert-base-uncased-finetuned-sst-2-english\"},\n)\ndoc = nlp(\"This is great!\")\nprint(doc.cats)\n# {'POSITIVE': 0.9998694658279419, 'NEGATIVE': 0.00013048505934420973}\n```\n\n### Batching and GPU\n\nBoth token and text classification support batching with `nlp.pipe`:\n\n```python\nfor doc in nlp.pipe(texts, batch_size=256):\n    do_something(doc)\n```\n\nIf the component runs into an error processing a batch (e.g. on an empty text),\n`nlp.pipe` will back off to processing each text individually. If it runs into\nan error on an individual text, a warning is shown and the doc is returned\nwithout additional annotation.\n\nSwitch to GPU:\n\n```python\nimport spacy\nspacy.require_gpu()\n\nfor doc in nlp.pipe(texts):\n    do_something(doc)\n```\n\n## Bug reports and issues\n\nPlease report bugs in the\n[spaCy issue tracker](https://github.com/explosion/spaCy/issues) or open a new\nthread on the [discussion board](https://github.com/explosion/spaCy/discussions)\nfor other issues.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "spaCy wrapper for Hugging Face Transformers pipelines",
    "version": "0.0.4",
    "project_urls": {
        "Homepage": "https://github.com/explosion/spacy-huggingface-pipelines",
        "Release notes": "https://github.com/explosion/spacy-huggingface-pipelines/releases",
        "Source": "https://github.com/explosion/spacy-huggingface-pipelines"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ba691cf6333eebaadf8517f59b9dec676f42f5fef8b13a29eaf2cd2922470868",
                "md5": "b52ebc695fda9cffba6ccb53b0758b5c",
                "sha256": "9e38ee6eba7a11fca32b7d14f38259f7805eec211e8959105a90c95915168b00"
            },
            "downloads": -1,
            "filename": "spacy_huggingface_pipelines-0.0.4-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "b52ebc695fda9cffba6ccb53b0758b5c",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": ">=3.8",
            "size": 11236,
            "upload_time": "2023-06-02T18:18:02",
            "upload_time_iso_8601": "2023-06-02T18:18:02.363686Z",
            "url": "https://files.pythonhosted.org/packages/ba/69/1cf6333eebaadf8517f59b9dec676f42f5fef8b13a29eaf2cd2922470868/spacy_huggingface_pipelines-0.0.4-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "38ca07667af54b4efb3ee204db6db6ba9a3e7d7baf59219e5c86f7888121be06",
                "md5": "a967cf1c4dab40128fe57b518177cee3",
                "sha256": "35b409ed7d20c5b36d788912570e3444ec1b0c344255e847bf722b3286279e95"
            },
            "downloads": -1,
            "filename": "spacy_huggingface_pipelines-0.0.4.tar.gz",
            "has_sig": false,
            "md5_digest": "a967cf1c4dab40128fe57b518177cee3",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 11685,
            "upload_time": "2023-06-02T18:18:03",
            "upload_time_iso_8601": "2023-06-02T18:18:03.859806Z",
            "url": "https://files.pythonhosted.org/packages/38/ca/07667af54b4efb3ee204db6db6ba9a3e7d7baf59219e5c86f7888121be06/spacy_huggingface_pipelines-0.0.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-06-02 18:18:03",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "explosion",
    "github_project": "spacy-huggingface-pipelines",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "spacy-huggingface-pipelines"
}

Explosion