# spacy_crfsuite: CRF tagger for spaCy.
Sequence tagging with spaCy and crfsuite.
A port of [Rasa NLU](https://github.com/RasaHQ/rasa/blob/master/rasa/nlu/extractors/crf_entity_extractor.py).
## ✨ Features
- Simple but tough to beat **CRF entity tagger** (
via [sklearn-crfsuite](https://github.com/TeamHG-Memex/sklearn-crfsuite))
- **spaCy NER component**
- **Command line interface** for training & evaluation and **example notebook**
- [CoNLL](https://www.aclweb.org/anthology/W03-0419/), JSON
and [Markdown](https://rasa.com/docs/rasa/nlu/training-data-format/#id5) **annotations**
- Pre-trained NER component
## ⏳ Installation
```bash
pip install spacy_crfsuite
```
## 🚀 Quickstart
### Usage as a spaCy pipeline component
```python
import spacy
from spacy.language import Language
from spacy_crfsuite import CRFEntityExtractor, CRFExtractor
@Language.factory("ner_crf")
def create_component(nlp, name):
crf_extractor = CRFExtractor().from_disk("spacy_crfsuite_conll03_sm.bz2")
return CRFEntityExtractor(nlp, crf_extractor=crf_extractor)
nlp = spacy.load("en_core_web_sm", disable=["ner"])
nlp.add_pipe("ner_crf")
doc = nlp(
"George Walker Bush (born July 6, 1946) is an American politician and businessman "
"who served as the 43rd president of the United States from 2001 to 2009.")
for ent in doc.ents:
print(ent, "-", ent.label_)
# Output:
# George Walker Bush - PER
# American - MISC
# United States - LOC
```
### Visualization (via [Gradio](https://gradio.app/named_entity_recognition/))
Run the command below to launch a Gradio playground
```sh
$ pip install gradio
$ python spacy_crfsuite/visualize.py
```

### Pre-trained models
You can download a pre-trained model.
| Dataset | F1 | 📥 Download |
|-------------------------------------------------------------------------------------------------------|-----|-----------------------------------------------------------------------------------------------------------------------------------|
| [CoNLL03](https://github.com/talmago/spacy_crfsuite/blob/master/examples/02%20-%20CoNLL%202003.ipynb) | 82% | [spacy_crfsuite_conll03_sm.bz2](https://github.com/talmago/spacy_crfsuite/releases/download/v1.1.0/spacy_crfsuite_conll03_sm.bz2) |
### Train your own model
Below is a command line to train a simple model for restaurants search bot with [markdown
annotations](https://github.com/talmago/spacy_crfsuite/blob/master/examples/restaurent_search.md) and save it to disk.
If you prefer working on jupyter, follow this [notebook](https://github.com/talmago/spacy_crfsuite/blob/master/examples/01%20-%20Custom%20Component.ipynb).
```sh
$ python -m spacy_crfsuite.train examples/restaurent_search.md -c examples/default-config.json -o model/ -lm en_core_web_sm
ℹ Loading config from disk
✔ Successfully loaded config from file.
examples/default-config.json
ℹ Loading training examples.
✔ Successfully loaded 15 training examples from file.
examples/restaurent_search.md
ℹ Using spaCy model: en_core_web_sm
ℹ Training entity tagger with CRF.
ℹ Saving model to disk
✔ Successfully saved model to file.
model/model.pkl
```
Below is a command line to test the CRF model and print the classification report (In the example we use the training set, however normally we would use a held out set).
```sh
$ python -m spacy_crfsuite.eval examples/restaurent_search.md -m model/model.pkl -lm en_core_web_sm
ℹ Loading model from file
model/model.pkl
✔ Successfully loaded CRF tagger
<spacy_crfsuite.crf_extractor.CRFExtractor object at 0x126e5f438>
ℹ Loading dev dataset from file
examples/example.md
✔ Successfully loaded 15 dev examples.
ℹ Using spaCy model: en_core_web_sm
ℹ Classification Report:
precision recall f1-score support
B-cuisine 1.000 1.000 1.000 2
I-cuisine 1.000 1.000 1.000 1
L-cuisine 1.000 1.000 1.000 2
U-cuisine 1.000 1.000 1.000 5
U-location 1.000 1.000 1.000 7
micro avg 1.000 1.000 1.000 17
macro avg 1.000 1.000 1.000 17
weighted avg 1.000 1.000 1.000 17
```
Now we can use the tagger for named entity recognition in a spaCy pipeline!
```python
import spacy
from spacy.language import Language
from spacy_crfsuite import CRFEntityExtractor, CRFExtractor
@Language.factory("ner_crf")
def create_component(nlp, name):
crf_extractor = CRFExtractor().from_disk("model/model.pkl")
return CRFEntityExtractor(nlp, crf_extractor=crf_extractor)
nlp = spacy.load("en_core_web_sm", disable=["ner"])
nlp.add_pipe("ner_crf")
doc = nlp("show mexican restaurents up north")
for ent in doc.ents:
print(ent.text, "--", ent.label_)
# Output:
# mexican -- cuisine
# north -- location
```
Or alternatively as a standalone component
```python
from spacy_crfsuite import CRFExtractor
from spacy_crfsuite.tokenizer import SpacyTokenizer
crf_extractor = CRFExtractor().from_disk("model/model.pkl")
tokenizer = SpacyTokenizer()
example = {"text": "show mexican restaurents up north"}
tokenizer.tokenize(example, attribute="text")
crf_extractor.process(example)
# Output:
# [{'start': 5,
# 'end': 12,
# 'value': 'mexican',
# 'entity': 'cuisine',
# 'confidence': 0.5823148506311286},
# {'start': 28,
# 'end': 33,
# 'value': 'north',
# 'entity': 'location',
# 'confidence': 0.8863076478494413}]
```
We can also take a look at what model learned.
Use the `.explain()` method to understand model decision.
```python
print(crf_extractor.explain())
# Output:
#
# Most likely transitions:
# O -> O 1.637338
# B-cuisine -> I-cuisine 1.373766
# U-cuisine -> O 1.306077
# I-cuisine -> L-cuisine 0.915989
# O -> U-location 0.751463
# B-cuisine -> L-cuisine 0.698893
# O -> U-cuisine 0.480360
# U-location -> U-cuisine 0.403487
# O -> B-cuisine 0.261450
# L-cuisine -> O 0.182695
#
# Positive features:
# 1.976502 O 0:bias:bias
# 1.957180 U-location -1:low:the
# 1.216547 B-cuisine -1:low:for
# 1.153924 U-location 0:prefix5:centr
# 1.153924 U-location 0:prefix2:ce
# 1.110536 U-location 0:digit
# 1.058294 U-cuisine 0:prefix5:chine
# 1.058294 U-cuisine 0:prefix2:ch
# 1.051457 U-cuisine 0:suffix2:an
# 0.999976 U-cuisine -1:low:me
```
> **Notice**: You can also access the `crf_extractor` directly with ```nlp.get_pipe("crf_ner").crf_extractor```.
### Deploy to a web server
Start a web service
```sh
$ pip install uvicorn
$ uvicorn spacy_crfsuite.serve:app --host 127.0.0.1 --port 5000
```
>Notice: Set `$SPACY_MODEL` and `$CRF_MODEL` in your environment to control the server configurations
cURL example
```sh
$ curl -X POST http://127.0.0.1:5000/parse -H 'Content-Type: application/json' -d '{"text": "George Walker Bush (born July 6, 1946) is an American politician and businessman who served as the 43rd president of the United States from 2001 to 2009."}'
{
"data": [
{
"text": "George Walker Bush (born July 6, 1946) is an American politician and businessman who served as the 43rd president of the United States from 2001 to 2009.",
"entities": [
{
"start": 0,
"end": 18,
"value": "George Walker Bush",
"entity": "PER"
},
{
"start": 45,
"end": 53,
"value": "American",
"entity": "MISC"
},
{
"start": 121,
"end": 134,
"value": "United States",
"entity": "LOC"
}
]
}
]
}
```
## Development
Set up env
```sh
$ poetry install
$ poetry run spacy download en_core_web_sm
```
Run unit test
```sh
$ poetry run pytest
```
Run black (code formatting)
```sh
$ poetry run black spacy_crfsuite/ --config=pyproject.toml
```
Raw data
{
"_id": null,
"home_page": "https://github.com/talmago/spacy_crfsuite",
"name": "spacy-crfsuite",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3",
"maintainer_email": "",
"keywords": "",
"author": "Tal Almagor",
"author_email": "almagoric@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/34/9d/6c2ac7f1f91e3441750df0f31d90685152cf11566ab98bd50b198c376f9d/spacy_crfsuite-1.7.0.tar.gz",
"platform": null,
"description": "\n# spacy_crfsuite: CRF tagger for spaCy.\n\nSequence tagging with spaCy and crfsuite.\n\nA port of [Rasa NLU](https://github.com/RasaHQ/rasa/blob/master/rasa/nlu/extractors/crf_entity_extractor.py).\n\n## \u2728 Features\n\n- Simple but tough to beat **CRF entity tagger** (\n via [sklearn-crfsuite](https://github.com/TeamHG-Memex/sklearn-crfsuite))\n- **spaCy NER component**\n- **Command line interface** for training & evaluation and **example notebook**\n- [CoNLL](https://www.aclweb.org/anthology/W03-0419/), JSON\n and [Markdown](https://rasa.com/docs/rasa/nlu/training-data-format/#id5) **annotations**\n- Pre-trained NER component\n\n## \u23f3 Installation\n\n```bash\npip install spacy_crfsuite\n```\n\n## \ud83d\ude80 Quickstart\n\n### Usage as a spaCy pipeline component\n\n```python\nimport spacy\n\nfrom spacy.language import Language\nfrom spacy_crfsuite import CRFEntityExtractor, CRFExtractor\n\n\n@Language.factory(\"ner_crf\")\ndef create_component(nlp, name):\n crf_extractor = CRFExtractor().from_disk(\"spacy_crfsuite_conll03_sm.bz2\")\n return CRFEntityExtractor(nlp, crf_extractor=crf_extractor)\n\n\nnlp = spacy.load(\"en_core_web_sm\", disable=[\"ner\"])\nnlp.add_pipe(\"ner_crf\")\n\ndoc = nlp(\n \"George Walker Bush (born July 6, 1946) is an American politician and businessman \"\n \"who served as the 43rd president of the United States from 2001 to 2009.\")\n\nfor ent in doc.ents:\n print(ent, \"-\", ent.label_)\n\n# Output:\n# George Walker Bush - PER\n# American - MISC\n# United States - LOC\n```\n\n### Visualization (via [Gradio](https://gradio.app/named_entity_recognition/))\n\nRun the command below to launch a Gradio playground\n\n```sh\n$ pip install gradio\n$ python spacy_crfsuite/visualize.py\n```\n\n\n\n\n### Pre-trained models\n\nYou can download a pre-trained model.\n\n| Dataset | F1 | \ud83d\udce5 Download |\n|-------------------------------------------------------------------------------------------------------|-----|-----------------------------------------------------------------------------------------------------------------------------------|\n| [CoNLL03](https://github.com/talmago/spacy_crfsuite/blob/master/examples/02%20-%20CoNLL%202003.ipynb) | 82% | [spacy_crfsuite_conll03_sm.bz2](https://github.com/talmago/spacy_crfsuite/releases/download/v1.1.0/spacy_crfsuite_conll03_sm.bz2) |\n\n### Train your own model\n\nBelow is a command line to train a simple model for restaurants search bot with [markdown\nannotations](https://github.com/talmago/spacy_crfsuite/blob/master/examples/restaurent_search.md) and save it to disk.\nIf you prefer working on jupyter, follow this [notebook](https://github.com/talmago/spacy_crfsuite/blob/master/examples/01%20-%20Custom%20Component.ipynb).\n\n\n```sh\n$ python -m spacy_crfsuite.train examples/restaurent_search.md -c examples/default-config.json -o model/ -lm en_core_web_sm\n\u2139 Loading config from disk\n\u2714 Successfully loaded config from file.\nexamples/default-config.json\n\u2139 Loading training examples.\n\u2714 Successfully loaded 15 training examples from file.\nexamples/restaurent_search.md\n\u2139 Using spaCy model: en_core_web_sm\n\u2139 Training entity tagger with CRF.\n\u2139 Saving model to disk\n\u2714 Successfully saved model to file.\nmodel/model.pkl\n```\n\nBelow is a command line to test the CRF model and print the classification report (In the example we use the training set, however normally we would use a held out set).\n\n```sh\n$ python -m spacy_crfsuite.eval examples/restaurent_search.md -m model/model.pkl -lm en_core_web_sm\n\u2139 Loading model from file\nmodel/model.pkl\n\u2714 Successfully loaded CRF tagger\n<spacy_crfsuite.crf_extractor.CRFExtractor object at 0x126e5f438>\n\u2139 Loading dev dataset from file\nexamples/example.md\n\u2714 Successfully loaded 15 dev examples.\n\u2139 Using spaCy model: en_core_web_sm\n\u2139 Classification Report:\n precision recall f1-score support\n\n B-cuisine 1.000 1.000 1.000 2\n I-cuisine 1.000 1.000 1.000 1\n L-cuisine 1.000 1.000 1.000 2\n U-cuisine 1.000 1.000 1.000 5\n U-location 1.000 1.000 1.000 7\n\n micro avg 1.000 1.000 1.000 17\n macro avg 1.000 1.000 1.000 17\nweighted avg 1.000 1.000 1.000 17\n```\n\nNow we can use the tagger for named entity recognition in a spaCy pipeline!\n\n```python\nimport spacy\n\nfrom spacy.language import Language\nfrom spacy_crfsuite import CRFEntityExtractor, CRFExtractor\n\n\n@Language.factory(\"ner_crf\")\ndef create_component(nlp, name):\n crf_extractor = CRFExtractor().from_disk(\"model/model.pkl\")\n return CRFEntityExtractor(nlp, crf_extractor=crf_extractor)\n\n\nnlp = spacy.load(\"en_core_web_sm\", disable=[\"ner\"])\nnlp.add_pipe(\"ner_crf\")\n\ndoc = nlp(\"show mexican restaurents up north\")\nfor ent in doc.ents:\n print(ent.text, \"--\", ent.label_)\n\n# Output:\n# mexican -- cuisine\n# north -- location\n```\n\nOr alternatively as a standalone component\n\n```python\nfrom spacy_crfsuite import CRFExtractor\nfrom spacy_crfsuite.tokenizer import SpacyTokenizer\n\ncrf_extractor = CRFExtractor().from_disk(\"model/model.pkl\")\ntokenizer = SpacyTokenizer()\n\nexample = {\"text\": \"show mexican restaurents up north\"}\ntokenizer.tokenize(example, attribute=\"text\")\ncrf_extractor.process(example)\n\n# Output:\n# [{'start': 5,\n# 'end': 12,\n# 'value': 'mexican',\n# 'entity': 'cuisine',\n# 'confidence': 0.5823148506311286},\n# {'start': 28,\n# 'end': 33,\n# 'value': 'north',\n# 'entity': 'location',\n# 'confidence': 0.8863076478494413}]\n```\n\nWe can also take a look at what model learned.\n\nUse the `.explain()` method to understand model decision.\n\n```python\nprint(crf_extractor.explain())\n\n# Output:\n#\n# Most likely transitions:\n# O -> O 1.637338\n# B-cuisine -> I-cuisine 1.373766\n# U-cuisine -> O 1.306077\n# I-cuisine -> L-cuisine 0.915989\n# O -> U-location 0.751463\n# B-cuisine -> L-cuisine 0.698893\n# O -> U-cuisine 0.480360\n# U-location -> U-cuisine 0.403487\n# O -> B-cuisine 0.261450\n# L-cuisine -> O 0.182695\n# \n# Positive features:\n# 1.976502 O 0:bias:bias\n# 1.957180 U-location -1:low:the\n# 1.216547 B-cuisine -1:low:for\n# 1.153924 U-location 0:prefix5:centr\n# 1.153924 U-location 0:prefix2:ce\n# 1.110536 U-location 0:digit\n# 1.058294 U-cuisine 0:prefix5:chine\n# 1.058294 U-cuisine 0:prefix2:ch\n# 1.051457 U-cuisine 0:suffix2:an\n# 0.999976 U-cuisine -1:low:me\n```\n\n> **Notice**: You can also access the `crf_extractor` directly with ```nlp.get_pipe(\"crf_ner\").crf_extractor```.\n\n### Deploy to a web server\n\nStart a web service\n\n```sh\n$ pip install uvicorn\n$ uvicorn spacy_crfsuite.serve:app --host 127.0.0.1 --port 5000\n```\n\n>Notice: Set `$SPACY_MODEL` and `$CRF_MODEL` in your environment to control the server configurations\n\ncURL example\n\n```sh\n$ curl -X POST http://127.0.0.1:5000/parse -H 'Content-Type: application/json' -d '{\"text\": \"George Walker Bush (born July 6, 1946) is an American politician and businessman who served as the 43rd president of the United States from 2001 to 2009.\"}'\n{\n \"data\": [\n {\n \"text\": \"George Walker Bush (born July 6, 1946) is an American politician and businessman who served as the 43rd president of the United States from 2001 to 2009.\",\n \"entities\": [\n {\n \"start\": 0,\n \"end\": 18,\n \"value\": \"George Walker Bush\",\n \"entity\": \"PER\"\n },\n {\n \"start\": 45,\n \"end\": 53,\n \"value\": \"American\",\n \"entity\": \"MISC\"\n },\n {\n \"start\": 121,\n \"end\": 134,\n \"value\": \"United States\",\n \"entity\": \"LOC\"\n }\n ]\n }\n ]\n}\n```\n\n## Development\n\nSet up env\n\n```sh\n$ poetry install\n$ poetry run spacy download en_core_web_sm\n```\n\nRun unit test\n\n```sh\n$ poetry run pytest\n```\n\nRun black (code formatting)\n\n```sh\n$ poetry run black spacy_crfsuite/ --config=pyproject.toml\n```\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "spaCy pipeline component for CRF entity extraction",
"version": "1.7.0",
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "638aa8756706aa915c4758aa4f5675ba54bd3f98ec68f21546d5ba838cf1dde6",
"md5": "6ebaac129716b52b80ad726204dc092c",
"sha256": "2a2154800294b2fb2e576bbf9e45d64a68d5ce75e222eb174fbe1d0c66290a6c"
},
"downloads": -1,
"filename": "spacy_crfsuite-1.7.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "6ebaac129716b52b80ad726204dc092c",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3",
"size": 26491,
"upload_time": "2023-03-18T09:19:52",
"upload_time_iso_8601": "2023-03-18T09:19:52.732385Z",
"url": "https://files.pythonhosted.org/packages/63/8a/a8756706aa915c4758aa4f5675ba54bd3f98ec68f21546d5ba838cf1dde6/spacy_crfsuite-1.7.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "349d6c2ac7f1f91e3441750df0f31d90685152cf11566ab98bd50b198c376f9d",
"md5": "335ada73fb15f9ec3d02dc29e9497a96",
"sha256": "1110945ffa5a1fb30a7a663e1b7bca1859c726728929ae00de5cffff51ca0b43"
},
"downloads": -1,
"filename": "spacy_crfsuite-1.7.0.tar.gz",
"has_sig": false,
"md5_digest": "335ada73fb15f9ec3d02dc29e9497a96",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3",
"size": 310817,
"upload_time": "2023-03-18T09:19:54",
"upload_time_iso_8601": "2023-03-18T09:19:54.931739Z",
"url": "https://files.pythonhosted.org/packages/34/9d/6c2ac7f1f91e3441750df0f31d90685152cf11566ab98bd50b198c376f9d/spacy_crfsuite-1.7.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-03-18 09:19:54",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "talmago",
"github_project": "spacy_crfsuite",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "spacy-crfsuite"
}