Name | spacy-stanza JSON |
Version |
1.0.4
JSON |
| download |
home_page | https://explosion.ai |
Summary | Use the latest Stanza (StanfordNLP) research models directly in spaCy |
upload_time | 2023-10-09 07:10:26 |
maintainer | |
docs_url | None |
author | Explosion |
requires_python | >=3.6 |
license | MIT |
keywords |
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
<a href="https://explosion.ai"><img src="https://explosion.ai/assets/img/logo.svg" width="125" height="125" align="right" /></a>
# spaCy + Stanza (formerly StanfordNLP)
This package wraps the [Stanza](https://github.com/stanfordnlp/stanza) (formerly
StanfordNLP) library, so you can use Stanford's models in a
[spaCy](https://spacy.io) pipeline. The Stanford models achieved top accuracy in
the CoNLL 2017 and 2018 shared task, which involves tokenization, part-of-speech
tagging, morphological analysis, lemmatization and labeled dependency parsing in
68 languages. As of v1.0, Stanza also supports named entity recognition for
selected languages.
> ⚠️ Previous version of this package were available as
> [`spacy-stanfordnlp`](https://pypi.python.org/pypi/spacy-stanfordnlp).
[![tests](https://github.com/explosion/spacy-stanza/actions/workflows/tests.yml/badge.svg)](https://github.com/explosion/spacy-stanza/actions/workflows/tests.yml)
[![PyPi](https://img.shields.io/pypi/v/spacy-stanza.svg?style=flat-square)](https://pypi.python.org/pypi/spacy-stanza)
[![GitHub](https://img.shields.io/github/release/explosion/spacy-stanza/all.svg?style=flat-square)](https://github.com/explosion/spacy-stanza)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg?style=flat-square)](https://github.com/ambv/black)
Using this wrapper, you'll be able to use the following annotations, computed by
your pretrained `stanza` model:
- Statistical tokenization (reflected in the `Doc` and its tokens)
- Lemmatization (`token.lemma` and `token.lemma_`)
- Part-of-speech tagging (`token.tag`, `token.tag_`, `token.pos`, `token.pos_`)
- Morphological analysis (`token.morph`)
- Dependency parsing (`token.dep`, `token.dep_`, `token.head`)
- Named entity recognition (`doc.ents`, `token.ent_type`, `token.ent_type_`,
`token.ent_iob`, `token.ent_iob_`)
- Sentence segmentation (`doc.sents`)
## ️️️⌛️ Installation
As of v1.0.0 `spacy-stanza` is only compatible with **spaCy v3.x**. To install
the most recent version:
```bash
pip install spacy-stanza
```
For spaCy v2, install v0.2.x and refer to the
[v0.2.x usage documentation](https://github.com/explosion/spacy-stanza/tree/v0.2.x#-usage--examples):
```bash
pip install "spacy-stanza<0.3.0"
```
Make sure to also
[download](https://stanfordnlp.github.io/stanza/download_models.html) one of the
[pre-trained Stanza models](https://stanfordnlp.github.io/stanza/models.html).
## 📖 Usage & Examples
> ⚠️ **Important note:** This package has been refactored to take advantage of
> [spaCy v3.0](https://spacy.io). Previous versions that were built for
> [spaCy v2.x](https://v2.spacy.io) worked considerably differently. Please see
> previous tagged versions of this README for documentation on prior versions.
Use `spacy_stanza.load_pipeline()` to create an `nlp` object that you can use to
process a text with a Stanza pipeline and create a spaCy
[`Doc` object](https://spacy.io/api/doc). By default, both the spaCy pipeline
and the Stanza pipeline will be initialized with the same `lang`, e.g. "en":
```python
import stanza
import spacy_stanza
# Download the stanza model if necessary
stanza.download("en")
# Initialize the pipeline
nlp = spacy_stanza.load_pipeline("en")
doc = nlp("Barack Obama was born in Hawaii. He was elected president in 2008.")
for token in doc:
print(token.text, token.lemma_, token.pos_, token.dep_, token.ent_type_)
print(doc.ents)
```
If language data for the given language is available in spaCy, the respective
language class can be used as the base for the `nlp` object – for example,
`English()`. This lets you use spaCy's lexical attributes like `is_stop` or
`like_num`. The `nlp` object follows the same API as any other spaCy `Language`
class – so you can visualize the `Doc` objects with displaCy, add custom
components to the pipeline, use the rule-based matcher and do pretty much
anything else you'd normally do in spaCy.
```python
# Access spaCy's lexical attributes
print([token.is_stop for token in doc])
print([token.like_num for token in doc])
# Visualize dependencies
from spacy import displacy
displacy.serve(doc) # or displacy.render if you're in a Jupyter notebook
# Process texts with nlp.pipe
for doc in nlp.pipe(["Lots of texts", "Even more texts", "..."]):
print(doc.text)
# Combine with your own custom pipeline components
from spacy import Language
@Language.component("custom_component")
def custom_component(doc):
# Do something to the doc here
print(f"Custom component called: {doc.text}")
return doc
nlp.add_pipe("custom_component")
doc = nlp("Some text")
# Serialize attributes to a numpy array
np_array = doc.to_array(['ORTH', 'LEMMA', 'POS'])
```
### Stanza Pipeline options
Additional options for the Stanza
[`Pipeline`](https://stanfordnlp.github.io/stanza/pipeline.html#pipeline) can be
provided as keyword arguments following the `Pipeline` API:
- Provide the Stanza language as `lang`. For Stanza languages without spaCy
support, use "xx" for the spaCy language setting:
```python
# Initialize a pipeline for Coptic
nlp = spacy_stanza.load_pipeline("xx", lang="cop")
```
- Provide Stanza pipeline settings following the `Pipeline` API:
```python
# Initialize a German pipeline with the `hdt` package
nlp = spacy_stanza.load_pipeline("de", package="hdt")
```
- Tokenize with spaCy rather than the statistical tokenizer (only for English):
```python
nlp = spacy_stanza.load_pipeline("en", processors= {"tokenize": "spacy"})
```
- Provide any additional processor settings as additional keyword arguments:
```python
# Provide pretokenized texts (whitespace tokenization)
nlp = spacy_stanza.load_pipeline("de", tokenize_pretokenized=True)
```
The spaCy config specifies all `Pipeline` options in the `[nlp.tokenizer]`
block. For example, the config for the last example above, a German pipeline
with pretokenized texts:
```ini
[nlp.tokenizer]
@tokenizers = "spacy_stanza.PipelineAsTokenizer.v1"
lang = "de"
dir = null
package = "default"
logging_level = null
verbose = null
use_gpu = true
[nlp.tokenizer.kwargs]
tokenize_pretokenized = true
[nlp.tokenizer.processors]
```
### Serialization
The full Stanza pipeline configuration is stored in the spaCy pipeline
[config](https://spacy.io/usage/training#config), so you can save and load the
pipeline just like any other `nlp` pipeline:
```python
# Save to a local directory
nlp.to_disk("./stanza-spacy-model")
# Reload the pipeline
nlp = spacy.load("./stanza-spacy-model")
```
Note that this **does not save any Stanza model data by default**. The Stanza
models are very large, so for now, this package expects you to download the
models separately with `stanza.download()` and have them available either in the
default model directory or in the path specified under `[nlp.tokenizer.dir]` in
the config.
### Adding additional spaCy pipeline components
By default, the spaCy pipeline in the `nlp` object returned by
`spacy_stanza.load_pipeline()` will be empty, because all `stanza` attributes
are computed and set within the custom tokenizer,
[`StanzaTokenizer`](spacy_stanza/tokenizer.py). But since it's a regular `nlp`
object, you can add your own components to the pipeline. For example, you could
add
[your own custom text classification component](https://spacy.io/usage/training)
with `nlp.add_pipe("textcat", source=source_nlp)`, or augment the named entities
with your own rule-based patterns using the
[`EntityRuler` component](https://spacy.io/usage/rule-based-matching#entityruler).
Raw data
{
"_id": null,
"home_page": "https://explosion.ai",
"name": "spacy-stanza",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": "",
"keywords": "",
"author": "Explosion",
"author_email": "contact@explosion.ai",
"download_url": "https://files.pythonhosted.org/packages/6f/c9/8d4183c5064d99ecf214d59aa1102c2d32368c198e17db6fa913a67fbaef/spacy-stanza-1.0.4.tar.gz",
"platform": null,
"description": "<a href=\"https://explosion.ai\"><img src=\"https://explosion.ai/assets/img/logo.svg\" width=\"125\" height=\"125\" align=\"right\" /></a>\n\n# spaCy + Stanza (formerly StanfordNLP)\n\nThis package wraps the [Stanza](https://github.com/stanfordnlp/stanza) (formerly\nStanfordNLP) library, so you can use Stanford's models in a\n[spaCy](https://spacy.io) pipeline. The Stanford models achieved top accuracy in\nthe CoNLL 2017 and 2018 shared task, which involves tokenization, part-of-speech\ntagging, morphological analysis, lemmatization and labeled dependency parsing in\n68 languages. As of v1.0, Stanza also supports named entity recognition for\nselected languages.\n\n> \u26a0\ufe0f Previous version of this package were available as\n> [`spacy-stanfordnlp`](https://pypi.python.org/pypi/spacy-stanfordnlp).\n\n[![tests](https://github.com/explosion/spacy-stanza/actions/workflows/tests.yml/badge.svg)](https://github.com/explosion/spacy-stanza/actions/workflows/tests.yml)\n[![PyPi](https://img.shields.io/pypi/v/spacy-stanza.svg?style=flat-square)](https://pypi.python.org/pypi/spacy-stanza)\n[![GitHub](https://img.shields.io/github/release/explosion/spacy-stanza/all.svg?style=flat-square)](https://github.com/explosion/spacy-stanza)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg?style=flat-square)](https://github.com/ambv/black)\n\nUsing this wrapper, you'll be able to use the following annotations, computed by\nyour pretrained `stanza` model:\n\n- Statistical tokenization (reflected in the `Doc` and its tokens)\n- Lemmatization (`token.lemma` and `token.lemma_`)\n- Part-of-speech tagging (`token.tag`, `token.tag_`, `token.pos`, `token.pos_`)\n- Morphological analysis (`token.morph`)\n- Dependency parsing (`token.dep`, `token.dep_`, `token.head`)\n- Named entity recognition (`doc.ents`, `token.ent_type`, `token.ent_type_`,\n `token.ent_iob`, `token.ent_iob_`)\n- Sentence segmentation (`doc.sents`)\n\n## \ufe0f\ufe0f\ufe0f\u231b\ufe0f Installation\n\nAs of v1.0.0 `spacy-stanza` is only compatible with **spaCy v3.x**. To install\nthe most recent version:\n\n```bash\npip install spacy-stanza\n```\n\nFor spaCy v2, install v0.2.x and refer to the\n[v0.2.x usage documentation](https://github.com/explosion/spacy-stanza/tree/v0.2.x#-usage--examples):\n\n```bash\npip install \"spacy-stanza<0.3.0\"\n```\n\nMake sure to also\n[download](https://stanfordnlp.github.io/stanza/download_models.html) one of the\n[pre-trained Stanza models](https://stanfordnlp.github.io/stanza/models.html).\n\n## \ud83d\udcd6 Usage & Examples\n\n> \u26a0\ufe0f **Important note:** This package has been refactored to take advantage of\n> [spaCy v3.0](https://spacy.io). Previous versions that were built for\n> [spaCy v2.x](https://v2.spacy.io) worked considerably differently. Please see\n> previous tagged versions of this README for documentation on prior versions.\n\nUse `spacy_stanza.load_pipeline()` to create an `nlp` object that you can use to\nprocess a text with a Stanza pipeline and create a spaCy\n[`Doc` object](https://spacy.io/api/doc). By default, both the spaCy pipeline\nand the Stanza pipeline will be initialized with the same `lang`, e.g. \"en\":\n\n```python\nimport stanza\nimport spacy_stanza\n\n# Download the stanza model if necessary\nstanza.download(\"en\")\n\n# Initialize the pipeline\nnlp = spacy_stanza.load_pipeline(\"en\")\n\ndoc = nlp(\"Barack Obama was born in Hawaii. He was elected president in 2008.\")\nfor token in doc:\n print(token.text, token.lemma_, token.pos_, token.dep_, token.ent_type_)\nprint(doc.ents)\n```\n\nIf language data for the given language is available in spaCy, the respective\nlanguage class can be used as the base for the `nlp` object \u2013 for example,\n`English()`. This lets you use spaCy's lexical attributes like `is_stop` or\n`like_num`. The `nlp` object follows the same API as any other spaCy `Language`\nclass \u2013 so you can visualize the `Doc` objects with displaCy, add custom\ncomponents to the pipeline, use the rule-based matcher and do pretty much\nanything else you'd normally do in spaCy.\n\n```python\n# Access spaCy's lexical attributes\nprint([token.is_stop for token in doc])\nprint([token.like_num for token in doc])\n\n# Visualize dependencies\nfrom spacy import displacy\ndisplacy.serve(doc) # or displacy.render if you're in a Jupyter notebook\n\n# Process texts with nlp.pipe\nfor doc in nlp.pipe([\"Lots of texts\", \"Even more texts\", \"...\"]):\n print(doc.text)\n\n# Combine with your own custom pipeline components\nfrom spacy import Language\n@Language.component(\"custom_component\")\ndef custom_component(doc):\n # Do something to the doc here\n print(f\"Custom component called: {doc.text}\")\n return doc\n\nnlp.add_pipe(\"custom_component\")\ndoc = nlp(\"Some text\")\n\n# Serialize attributes to a numpy array\nnp_array = doc.to_array(['ORTH', 'LEMMA', 'POS'])\n```\n\n### Stanza Pipeline options\n\nAdditional options for the Stanza\n[`Pipeline`](https://stanfordnlp.github.io/stanza/pipeline.html#pipeline) can be\nprovided as keyword arguments following the `Pipeline` API:\n\n- Provide the Stanza language as `lang`. For Stanza languages without spaCy\n support, use \"xx\" for the spaCy language setting:\n\n ```python\n # Initialize a pipeline for Coptic\n nlp = spacy_stanza.load_pipeline(\"xx\", lang=\"cop\")\n ```\n\n- Provide Stanza pipeline settings following the `Pipeline` API:\n\n ```python\n # Initialize a German pipeline with the `hdt` package\n nlp = spacy_stanza.load_pipeline(\"de\", package=\"hdt\")\n ```\n\n- Tokenize with spaCy rather than the statistical tokenizer (only for English):\n\n ```python\n nlp = spacy_stanza.load_pipeline(\"en\", processors= {\"tokenize\": \"spacy\"})\n ```\n\n- Provide any additional processor settings as additional keyword arguments:\n\n ```python\n # Provide pretokenized texts (whitespace tokenization)\n nlp = spacy_stanza.load_pipeline(\"de\", tokenize_pretokenized=True)\n ```\n\nThe spaCy config specifies all `Pipeline` options in the `[nlp.tokenizer]`\nblock. For example, the config for the last example above, a German pipeline\nwith pretokenized texts:\n\n```ini\n[nlp.tokenizer]\n@tokenizers = \"spacy_stanza.PipelineAsTokenizer.v1\"\nlang = \"de\"\ndir = null\npackage = \"default\"\nlogging_level = null\nverbose = null\nuse_gpu = true\n\n[nlp.tokenizer.kwargs]\ntokenize_pretokenized = true\n\n[nlp.tokenizer.processors]\n```\n\n### Serialization\n\nThe full Stanza pipeline configuration is stored in the spaCy pipeline\n[config](https://spacy.io/usage/training#config), so you can save and load the\npipeline just like any other `nlp` pipeline:\n\n```python\n# Save to a local directory\nnlp.to_disk(\"./stanza-spacy-model\")\n\n# Reload the pipeline\nnlp = spacy.load(\"./stanza-spacy-model\")\n```\n\nNote that this **does not save any Stanza model data by default**. The Stanza\nmodels are very large, so for now, this package expects you to download the\nmodels separately with `stanza.download()` and have them available either in the\ndefault model directory or in the path specified under `[nlp.tokenizer.dir]` in\nthe config.\n\n### Adding additional spaCy pipeline components\n\nBy default, the spaCy pipeline in the `nlp` object returned by\n`spacy_stanza.load_pipeline()` will be empty, because all `stanza` attributes\nare computed and set within the custom tokenizer,\n[`StanzaTokenizer`](spacy_stanza/tokenizer.py). But since it's a regular `nlp`\nobject, you can add your own components to the pipeline. For example, you could\nadd\n[your own custom text classification component](https://spacy.io/usage/training)\nwith `nlp.add_pipe(\"textcat\", source=source_nlp)`, or augment the named entities\nwith your own rule-based patterns using the\n[`EntityRuler` component](https://spacy.io/usage/rule-based-matching#entityruler).\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Use the latest Stanza (StanfordNLP) research models directly in spaCy",
"version": "1.0.4",
"project_urls": {
"Homepage": "https://explosion.ai",
"Release notes": "https://github.com/explosion/spacy-stanza/releases",
"Source": "https://github.com/explosion/spacy-stanza"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "dde1d38eff51089bba011c13cf181c2d0e8358c10e00752f397fd27147ad8e6e",
"md5": "b83f4012da4995b7101dde2655c2ca55",
"sha256": "1ffb2f0053bb1361adf49c08fcac77214c2240770002239b1a715be3e25c6c99"
},
"downloads": -1,
"filename": "spacy_stanza-1.0.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "b83f4012da4995b7101dde2655c2ca55",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6",
"size": 9723,
"upload_time": "2023-10-09T07:10:24",
"upload_time_iso_8601": "2023-10-09T07:10:24.732623Z",
"url": "https://files.pythonhosted.org/packages/dd/e1/d38eff51089bba011c13cf181c2d0e8358c10e00752f397fd27147ad8e6e/spacy_stanza-1.0.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "6fc98d4183c5064d99ecf214d59aa1102c2d32368c198e17db6fa913a67fbaef",
"md5": "02195fdb1654e0b178fdfa49f40ad22b",
"sha256": "4236d37dcd9342b6e1a1d4d86cb840c01a31ab3c48a5052ea4b6cf8abcc4f334"
},
"downloads": -1,
"filename": "spacy-stanza-1.0.4.tar.gz",
"has_sig": false,
"md5_digest": "02195fdb1654e0b178fdfa49f40ad22b",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 13859,
"upload_time": "2023-10-09T07:10:26",
"upload_time_iso_8601": "2023-10-09T07:10:26.297008Z",
"url": "https://files.pythonhosted.org/packages/6f/c9/8d4183c5064d99ecf214d59aa1102c2d32368c198e17db6fa913a67fbaef/spacy-stanza-1.0.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-10-09 07:10:26",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "explosion",
"github_project": "spacy-stanza",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "spacy-stanza"
}