span-marker

Name	span-marker JSON
Version	1.7.0 JSON
	download
home_page	None
Summary	Named Entity Recognition using Span Markers
upload_time	2025-01-08 11:39:11
maintainer	Tom Aarsen
docs_url	None
author	Tom Aarsen
requires_python	>=3.9
license	Apache-2.0
keywords	data-science natural-language-processing artificial-intelligence mlops nlp machine-learning transformers
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <div align="center">
<h1>
SpanMarker for Named Entity Recognition
</h1>
<a href="https://huggingface.co/tomaarsen/span-marker-roberta-large-ontonotes5" target="_blank">
    <img src="https://github.com/tomaarsen/SpanMarkerNER/assets/37621491/c76d6393-bb0b-44c3-9412-fd9c8313dcc1">
</a>

[🤗 Models](https://huggingface.co/models?library=span-marker) |
[🛠️ Getting Started In Google Colab](https://colab.research.google.com/github/tomaarsen/SpanMarkerNER/blob/main/notebooks/getting_started.ipynb) |
[📄 Documentation](https://tomaarsen.github.io/SpanMarkerNER) | 📊 [Thesis](https://raw.githubusercontent.com/tomaarsen/SpanMarkerNER/main/thesis.pdf)
</div>

SpanMarker is a framework for training powerful Named Entity Recognition models using familiar encoders such as BERT, RoBERTa and ELECTRA.
Built on top of the familiar [🤗 Transformers](https://github.com/huggingface/transformers) library, SpanMarker inherits a wide range of powerful functionalities, such as easily loading and saving models, hyperparameter optimization, automatic logging in various tools, checkpointing, callbacks, mixed precision training, 8-bit inference, and more.

<!--Tightly implemented on top of the [🤗 Transformers](https://github.com/huggingface/transformers/) library, SpanMarker can take advantage of its valuable functionality.-->
<!-- like performance dashboard integration, automatic mixed precision, 8-bit inference-->

Based on the [PL-Marker](https://arxiv.org/pdf/2109.06067.pdf) paper, SpanMarker breaks the mold through its accessibility and ease of use. Crucially, SpanMarker works out of the box with many common encoders such as `bert-base-cased`, `roberta-large` and `bert-base-multilingual-cased`, and automatically works with datasets using the `IOB`, `IOB2`, `BIOES`, `BILOU` or no label annotation scheme.

Additionally, the SpanMarker library has been integrated with the Hugging Face Hub and the Hugging Face Inference API. See the SpanMarker documentation on [Hugging Face](https://huggingface.co/docs/hub/span_marker) or see [all SpanMarker models on the Hugging Face Hub](https://huggingface.co/models?library=span-marker).
Through the Inference API integration, users can test any SpanMarker model on the Hugging Face Hub for free using a widget on the [model page](https://huggingface.co/tomaarsen/span-marker-bert-base-fewnerd-fine-super). Furthermore, each public SpanMarker model offers a free API for fast prototyping and can be deployed to production using Hugging Face Inference Endpoints.

| Inference API Widget (on a model page) | Free Inference API (`Deploy` > `Inference API` on a model page) |
| ------------- | ------------- |
|  ![image](https://github.com/tomaarsen/SpanMarkerNER/assets/37621491/234078b7-22c8-491c-8686-faccd394f683) |  ![image](https://github.com/tomaarsen/SpanMarkerNER/assets/37621491/410e5191-9354-4e27-b718-2d69af678eb7) |

## Documentation
Feel free to have a look at the [documentation](https://tomaarsen.github.io/SpanMarkerNER).

## Installation
You may install the [`span_marker`](https://pypi.org/project/span-marker) Python module via `pip` like so:
```
pip install span_marker
```

## Quick Start
### Training
Please have a look at our [Getting Started](notebooks/getting_started.ipynb) notebook for details on how SpanMarker is commonly used. It explains the following snippet in more detail. Alternatively, have a look at the [training scripts](training_scripts) that have been successfully used in the past.

| Colab                                                                                                                                                                                                         | Kaggle                                                                                                                                                                                                             | Gradient                                                                                                                                                                                         | Studio Lab                                                                                                                                                                                                             |
|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/tomaarsen/SpanMarkerNER/blob/main/notebooks/getting_started.ipynb)                       | [![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/tomaarsen/SpanMarkerNER/blob/main/notebooks/getting_started.ipynb)                       | [![Gradient](https://assets.paperspace.io/img/gradient-badge.svg)](https://console.paperspace.com/github/tomaarsen/SpanMarkerNER/blob/main/notebooks/getting_started.ipynb)                       | [![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/tomaarsen/SpanMarkerNER/blob/main/notebooks/getting_started.ipynb)                       |

```python
from pathlib import Path
from datasets import load_dataset
from transformers import TrainingArguments
from span_marker import SpanMarkerModel, Trainer, SpanMarkerModelCardData


def main() -> None:
    # Load the dataset, ensure "tokens" and "ner_tags" columns, and get a list of labels
    dataset_id = "DFKI-SLT/few-nerd"
    dataset_name = "FewNERD"
    dataset = load_dataset(dataset_id, "supervised")
    dataset = dataset.remove_columns("ner_tags")
    dataset = dataset.rename_column("fine_ner_tags", "ner_tags")
    labels = dataset["train"].features["ner_tags"].feature.names
    # ['O', 'art-broadcastprogram', 'art-film', 'art-music', 'art-other', ...

    # Initialize a SpanMarker model using a pretrained BERT-style encoder
    encoder_id = "bert-base-cased"
    model_id = f"tomaarsen/span-marker-{encoder_id}-fewnerd-fine-super"
    model = SpanMarkerModel.from_pretrained(
        encoder_id,
        labels=labels,
        # SpanMarker hyperparameters:
        model_max_length=256,
        marker_max_length=128,
        entity_max_length=8,
        # Model card arguments
        model_card_data=SpanMarkerModelCardData(
            model_id=model_id,
            encoder_id=encoder_id,
            dataset_name=dataset_name,
            dataset_id=dataset_id,
            license="cc-by-sa-4.0",
            language="en",
        ),
    )

    # Prepare the 🤗 transformers training arguments
    output_dir = Path("models") / model_id
    args = TrainingArguments(
        output_dir=output_dir,
        # Training Hyperparameters:
        learning_rate=5e-5,
        per_device_train_batch_size=32,
        per_device_eval_batch_size=32,
        num_train_epochs=3,
        weight_decay=0.01,
        warmup_ratio=0.1,
        bf16=True,  # Replace `bf16` with `fp16` if your hardware can't use bf16.
        # Other Training parameters
        logging_first_step=True,
        logging_steps=50,
        evaluation_strategy="steps",
        save_strategy="steps",
        eval_steps=3000,
        save_total_limit=2,
        dataloader_num_workers=2,
    )

    # Initialize the trainer using our model, training args & dataset, and train
    trainer = Trainer(
        model=model,
        args=args,
        train_dataset=dataset["train"],
        eval_dataset=dataset["validation"],
    )
    trainer.train()

    # Compute & save the metrics on the test set
    metrics = trainer.evaluate(dataset["test"], metric_key_prefix="test")
    trainer.save_metrics("test", metrics)

    # Save the final checkpoint
    trainer.save_model(output_dir / "checkpoint-final")

if __name__ == "__main__":
    main()
```

### Inference
```python
from span_marker import SpanMarkerModel

# Download from the 🤗 Hub
model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-bert-base-fewnerd-fine-super")
# Run inference
entities = model.predict("Amelia Earhart flew her single engine Lockheed Vega 5B across the Atlantic to Paris.")
[{'span': 'Amelia Earhart', 'label': 'person-other', 'score': 0.7659597396850586, 'char_start_index': 0, 'char_end_index': 14},
 {'span': 'Lockheed Vega 5B', 'label': 'product-airplane', 'score': 0.9725785851478577, 'char_start_index': 38, 'char_end_index': 54},
 {'span': 'Atlantic', 'label': 'location-bodiesofwater', 'score': 0.7587679028511047, 'char_start_index': 66, 'char_end_index': 74},
 {'span': 'Paris', 'label': 'location-GPE', 'score': 0.9892390966415405, 'char_start_index': 78, 'char_end_index': 83}]
```

## Pretrained Models

All models in this list contain `train.py` files that show the training scripts used to generate them. Additionally, all training scripts used are stored in the [training_scripts](training_scripts) directory.
These trained models have Hosted Inference API widgets that you can use to experiment with the models on their Hugging Face model pages. Additionally, Hugging Face provides each model with a free API (`Deploy` > `Inference API` on the model page).

These models are further elaborated on in my [thesis](https://raw.githubusercontent.com/tomaarsen/SpanMarkerNER/main/thesis.pdf).

### FewNERD
* [`tomaarsen/span-marker-bert-base-fewnerd-fine-super`](https://huggingface.co/tomaarsen/span-marker-bert-base-fewnerd-fine-super) is a model that I have trained in 2 hours on the finegrained, supervised [Few-NERD dataset](https://huggingface.co/datasets/DFKI-SLT/few-nerd). It reached a 70.53 Test F1, competitive in the all-time [Few-NERD leaderboard](https://paperswithcode.com/sota/named-entity-recognition-on-few-nerd-sup) using `bert-base`. My training script resembles the one that you can see above.

* [`tomaarsen/span-marker-roberta-large-fewnerd-fine-super`](https://huggingface.co/tomaarsen/span-marker-roberta-large-fewnerd-fine-super) was trained in 6 hours on the finegrained, supervised [Few-NERD dataset](https://huggingface.co/datasets/DFKI-SLT/few-nerd) using `roberta-large`. It reached a 71.03 Test F1, reaching a new state of the art in the all-time [Few-NERD leaderboard](https://paperswithcode.com/sota/named-entity-recognition-on-few-nerd-sup).
* [`tomaarsen/span-marker-xlm-roberta-base-fewnerd-fine-super`](https://huggingface.co/tomaarsen/span-marker-xlm-roberta-base-fewnerd-fine-super) is a multilingual model that I have trained in 1.5 hours on the finegrained, supervised [Few-NERD dataset](https://huggingface.co/datasets/DFKI-SLT/few-nerd). It reached a 68.6 Test F1 on English, and works well on other languages like Spanish, French, German, Russian, Dutch, Polish, Icelandic, Greek and many more.

### OntoNotes v5.0
* [`tomaarsen/span-marker-roberta-large-ontonotes5`](https://huggingface.co/tomaarsen/span-marker-roberta-large-ontonotes5) was trained in 3 hours on the OntoNotes v5.0 dataset, reaching a performance of 91.54 F1. For reference, the current strongest spaCy model (`en_core_web_trf`) reaches 89.8 F1. This SpanMarker model uses a `roberta-large` encoder under the hood.

### CoNLL03
* [`tomaarsen/span-marker-xlm-roberta-large-conll03`](https://huggingface.co/tomaarsen/span-marker-xlm-roberta-large-conll03) is a SpanMarker model using `xlm-roberta-large` that was trained in 45 minutes. It reaches a state of the art 93.1 F1 on CoNLL03 without using document-level context. For reference, the current strongest spaCy model (`en_core_web_trf`) reaches 91.6.
* [`tomaarsen/span-marker-xlm-roberta-large-conll03-doc-context`](https://huggingface.co/tomaarsen/span-marker-xlm-roberta-large-conll03-doc-context) is another SpanMarker model using the `xlm-roberta-large` encoder. It uses [document-level context](https://tomaarsen.github.io/SpanMarkerNER/notebooks/document_level_context.html) to reach a state of the art 94.4 F1. For the best performance, inference should be performed using document-level context ([docs](https://tomaarsen.github.io/SpanMarkerNER/notebooks/document_level_context.html#Inference)). This model was trained in 1 hour.

### CoNLL++
* [`tomaarsen/span-marker-xlm-roberta-large-conllpp-doc-context`](https://huggingface.co/tomaarsen/span-marker-xlm-roberta-large-conllpp-doc-context) was trained in an hour using the `xlm-roberta-large` encoder on the CoNLL++ dataset. Using [document-level context](https://tomaarsen.github.io/SpanMarkerNER/notebooks/document_level_context.html), it reaches a very competitive 95.5 F1. For the best performance, inference should be performed using document-level context ([docs](https://tomaarsen.github.io/SpanMarkerNER/notebooks/document_level_context.html#Inference)).

### MultiNERD
* [`tomaarsen/span-marker-xlm-roberta-base-multinerd`](https://huggingface.co/tomaarsen/span-marker-xlm-roberta-base-multinerd) is a multilingual SpanMarker model using the `xlm-roberta-large` encoder trained on the huge [MultiNERD](https://huggingface.co/datasets/Babelscape/multinerd) dataset. It reaches a 91.31 F1 on all 10 training languages and 94.55 F1 on English only. The model can predict between 15 classes. For best performance, separate punctuation from your words as described [here](https://huggingface.co/tomaarsen/span-marker-xlm-roberta-base-multinerd#limitations). Note that [`tomaarsen/span-marker-mbert-base-multinerd`](https://huggingface.co/tomaarsen/span-marker-mbert-base-multinerd) does not have this limitation and performs better.

* [`tomaarsen/span-marker-mbert-base-multinerd`](https://huggingface.co/tomaarsen/span-marker-mbert-base-multinerd) is the successor of [`tomaarsen/span-marker-xlm-roberta-base-multinerd`](https://huggingface.co/tomaarsen/span-marker-xlm-roberta-base-multinerd). It's a multilingual SpanMarker model using `bert-base-multilingual-cased` trained on the [MultiNERD](https://huggingface.co/datasets/Babelscape/multinerd) dataset. It reaches a state-of-the-art 92.48 F1 on all 10 training languages and 95.18 F1 on English only. This model generalizes well to languages using the Latin and Cyrillic script.

## Using pretrained SpanMarker models with spaCy
All [SpanMarker models on the Hugging Face Hub](https://huggingface.co/models?library=span-marker) can also be easily used in spaCy. It's as simple as including 1 line to add the `span_marker` pipeline. See the [Documentation](https://tomaarsen.github.io/SpanMarkerNER/notebooks/spacy_integration.html) or [API Reference](https://tomaarsen.github.io/SpanMarkerNER/api/span_marker.spacy_integration.html) for more information.
```python
import spacy

# Load the spaCy model with the span_marker pipeline component
nlp = spacy.load("en_core_web_sm", exclude=["ner"])
nlp.add_pipe("span_marker", config={"model": "tomaarsen/span-marker-roberta-large-ontonotes5"})

# Feed some text through the model to get a spacy Doc
text = """Cleopatra VII, also known as Cleopatra the Great, was the last active ruler of the \
Ptolemaic Kingdom of Egypt. She was born in 69 BCE and ruled Egypt from 51 BCE until her \
death in 30 BCE."""
doc = nlp(text)

# And look at the entities
print([(entity, entity.label_) for entity in doc.ents])
"""
[(Cleopatra VII, "PERSON"), (Cleopatra the Great, "PERSON"), (the Ptolemaic Kingdom of Egypt, "GPE"),
(69 BCE, "DATE"), (Egypt, "GPE"), (51 BCE, "DATE"), (30 BCE, "DATE")]
"""
```
![image](https://user-images.githubusercontent.com/37621491/246170623-6351cb7e-bbb0-4472-af16-9a351a253da9.png)

## Context
<h1 align="center">
    <a href="https://github.com/argilla-io/argilla">
    <img src="https://github.com/dvsrepo/imgs/raw/main/rg.svg" alt="Argilla" width="150">
    </a>
</h1>

I have developed this library as a part of my thesis work at [Argilla](https://github.com/argilla-io/argilla). Feel free to read my finished thesis [here](https://raw.githubusercontent.com/tomaarsen/SpanMarkerNER/main/thesis.pdf) in this repository!

## Changelog
See [CHANGELOG.md](CHANGELOG.md) for news on all SpanMarker versions.

## License
See [LICENSE](LICENSE) for the current license.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "span-marker",
    "maintainer": "Tom Aarsen",
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "data-science, natural-language-processing, artificial-intelligence, mlops, nlp, machine-learning, transformers",
    "author": "Tom Aarsen",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/a2/47/20e4c82290d2c2ff47dae5224f7db1f74597a2c69df9b8994e223060ee31/span_marker-1.7.0.tar.gz",
    "platform": null,
    "description": "<div align=\"center\">\r\n<h1>\r\nSpanMarker for Named Entity Recognition\r\n</h1>\r\n<a href=\"https://huggingface.co/tomaarsen/span-marker-roberta-large-ontonotes5\" target=\"_blank\">\r\n    <img src=\"https://github.com/tomaarsen/SpanMarkerNER/assets/37621491/c76d6393-bb0b-44c3-9412-fd9c8313dcc1\">\r\n</a>\r\n\r\n[\ud83e\udd17 Models](https://huggingface.co/models?library=span-marker) |\r\n[\ud83d\udee0\ufe0f Getting Started In Google Colab](https://colab.research.google.com/github/tomaarsen/SpanMarkerNER/blob/main/notebooks/getting_started.ipynb) |\r\n[\ud83d\udcc4 Documentation](https://tomaarsen.github.io/SpanMarkerNER) | \ud83d\udcca [Thesis](https://raw.githubusercontent.com/tomaarsen/SpanMarkerNER/main/thesis.pdf)\r\n</div>\r\n\r\nSpanMarker is a framework for training powerful Named Entity Recognition models using familiar encoders such as BERT, RoBERTa and ELECTRA.\r\nBuilt on top of the familiar [\ud83e\udd17 Transformers](https://github.com/huggingface/transformers) library, SpanMarker inherits a wide range of powerful functionalities, such as easily loading and saving models, hyperparameter optimization, automatic logging in various tools, checkpointing, callbacks, mixed precision training, 8-bit inference, and more.\r\n\r\n<!--Tightly implemented on top of the [\ud83e\udd17 Transformers](https://github.com/huggingface/transformers/) library, SpanMarker can take advantage of its valuable functionality.-->\r\n<!-- like performance dashboard integration, automatic mixed precision, 8-bit inference-->\r\n\r\nBased on the [PL-Marker](https://arxiv.org/pdf/2109.06067.pdf) paper, SpanMarker breaks the mold through its accessibility and ease of use. Crucially, SpanMarker works out of the box with many common encoders such as `bert-base-cased`, `roberta-large` and `bert-base-multilingual-cased`, and automatically works with datasets using the `IOB`, `IOB2`, `BIOES`, `BILOU` or no label annotation scheme.\r\n\r\nAdditionally, the SpanMarker library has been integrated with the Hugging Face Hub and the Hugging Face Inference API. See the SpanMarker documentation on [Hugging Face](https://huggingface.co/docs/hub/span_marker) or see [all SpanMarker models on the Hugging Face Hub](https://huggingface.co/models?library=span-marker).\r\nThrough the Inference API integration, users can test any SpanMarker model on the Hugging Face Hub for free using a widget on the [model page](https://huggingface.co/tomaarsen/span-marker-bert-base-fewnerd-fine-super). Furthermore, each public SpanMarker model offers a free API for fast prototyping and can be deployed to production using Hugging Face Inference Endpoints.\r\n\r\n| Inference API Widget (on a model page) | Free Inference API (`Deploy` > `Inference API` on a model page) |\r\n| ------------- | ------------- |\r\n|  ![image](https://github.com/tomaarsen/SpanMarkerNER/assets/37621491/234078b7-22c8-491c-8686-faccd394f683) |  ![image](https://github.com/tomaarsen/SpanMarkerNER/assets/37621491/410e5191-9354-4e27-b718-2d69af678eb7) |\r\n\r\n## Documentation\r\nFeel free to have a look at the [documentation](https://tomaarsen.github.io/SpanMarkerNER).\r\n\r\n## Installation\r\nYou may install the [`span_marker`](https://pypi.org/project/span-marker) Python module via `pip` like so:\r\n```\r\npip install span_marker\r\n```\r\n\r\n## Quick Start\r\n### Training\r\nPlease have a look at our [Getting Started](notebooks/getting_started.ipynb) notebook for details on how SpanMarker is commonly used. It explains the following snippet in more detail. Alternatively, have a look at the [training scripts](training_scripts) that have been successfully used in the past.\r\n\r\n| Colab                                                                                                                                                                                                         | Kaggle                                                                                                                                                                                                             | Gradient                                                                                                                                                                                         | Studio Lab                                                                                                                                                                                                             |\r\n|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\r\n| [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/tomaarsen/SpanMarkerNER/blob/main/notebooks/getting_started.ipynb)                       | [![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/tomaarsen/SpanMarkerNER/blob/main/notebooks/getting_started.ipynb)                       | [![Gradient](https://assets.paperspace.io/img/gradient-badge.svg)](https://console.paperspace.com/github/tomaarsen/SpanMarkerNER/blob/main/notebooks/getting_started.ipynb)                       | [![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/tomaarsen/SpanMarkerNER/blob/main/notebooks/getting_started.ipynb)                       |\r\n\r\n```python\r\nfrom pathlib import Path\r\nfrom datasets import load_dataset\r\nfrom transformers import TrainingArguments\r\nfrom span_marker import SpanMarkerModel, Trainer, SpanMarkerModelCardData\r\n\r\n\r\ndef main() -> None:\r\n    # Load the dataset, ensure \"tokens\" and \"ner_tags\" columns, and get a list of labels\r\n    dataset_id = \"DFKI-SLT/few-nerd\"\r\n    dataset_name = \"FewNERD\"\r\n    dataset = load_dataset(dataset_id, \"supervised\")\r\n    dataset = dataset.remove_columns(\"ner_tags\")\r\n    dataset = dataset.rename_column(\"fine_ner_tags\", \"ner_tags\")\r\n    labels = dataset[\"train\"].features[\"ner_tags\"].feature.names\r\n    # ['O', 'art-broadcastprogram', 'art-film', 'art-music', 'art-other', ...\r\n\r\n    # Initialize a SpanMarker model using a pretrained BERT-style encoder\r\n    encoder_id = \"bert-base-cased\"\r\n    model_id = f\"tomaarsen/span-marker-{encoder_id}-fewnerd-fine-super\"\r\n    model = SpanMarkerModel.from_pretrained(\r\n        encoder_id,\r\n        labels=labels,\r\n        # SpanMarker hyperparameters:\r\n        model_max_length=256,\r\n        marker_max_length=128,\r\n        entity_max_length=8,\r\n        # Model card arguments\r\n        model_card_data=SpanMarkerModelCardData(\r\n            model_id=model_id,\r\n            encoder_id=encoder_id,\r\n            dataset_name=dataset_name,\r\n            dataset_id=dataset_id,\r\n            license=\"cc-by-sa-4.0\",\r\n            language=\"en\",\r\n        ),\r\n    )\r\n\r\n    # Prepare the \ud83e\udd17 transformers training arguments\r\n    output_dir = Path(\"models\") / model_id\r\n    args = TrainingArguments(\r\n        output_dir=output_dir,\r\n        # Training Hyperparameters:\r\n        learning_rate=5e-5,\r\n        per_device_train_batch_size=32,\r\n        per_device_eval_batch_size=32,\r\n        num_train_epochs=3,\r\n        weight_decay=0.01,\r\n        warmup_ratio=0.1,\r\n        bf16=True,  # Replace `bf16` with `fp16` if your hardware can't use bf16.\r\n        # Other Training parameters\r\n        logging_first_step=True,\r\n        logging_steps=50,\r\n        evaluation_strategy=\"steps\",\r\n        save_strategy=\"steps\",\r\n        eval_steps=3000,\r\n        save_total_limit=2,\r\n        dataloader_num_workers=2,\r\n    )\r\n\r\n    # Initialize the trainer using our model, training args & dataset, and train\r\n    trainer = Trainer(\r\n        model=model,\r\n        args=args,\r\n        train_dataset=dataset[\"train\"],\r\n        eval_dataset=dataset[\"validation\"],\r\n    )\r\n    trainer.train()\r\n\r\n    # Compute & save the metrics on the test set\r\n    metrics = trainer.evaluate(dataset[\"test\"], metric_key_prefix=\"test\")\r\n    trainer.save_metrics(\"test\", metrics)\r\n\r\n    # Save the final checkpoint\r\n    trainer.save_model(output_dir / \"checkpoint-final\")\r\n\r\nif __name__ == \"__main__\":\r\n    main()\r\n```\r\n\r\n### Inference\r\n```python\r\nfrom span_marker import SpanMarkerModel\r\n\r\n# Download from the \ud83e\udd17 Hub\r\nmodel = SpanMarkerModel.from_pretrained(\"tomaarsen/span-marker-bert-base-fewnerd-fine-super\")\r\n# Run inference\r\nentities = model.predict(\"Amelia Earhart flew her single engine Lockheed Vega 5B across the Atlantic to Paris.\")\r\n[{'span': 'Amelia Earhart', 'label': 'person-other', 'score': 0.7659597396850586, 'char_start_index': 0, 'char_end_index': 14},\r\n {'span': 'Lockheed Vega 5B', 'label': 'product-airplane', 'score': 0.9725785851478577, 'char_start_index': 38, 'char_end_index': 54},\r\n {'span': 'Atlantic', 'label': 'location-bodiesofwater', 'score': 0.7587679028511047, 'char_start_index': 66, 'char_end_index': 74},\r\n {'span': 'Paris', 'label': 'location-GPE', 'score': 0.9892390966415405, 'char_start_index': 78, 'char_end_index': 83}]\r\n```\r\n\r\n## Pretrained Models\r\n\r\nAll models in this list contain `train.py` files that show the training scripts used to generate them. Additionally, all training scripts used are stored in the [training_scripts](training_scripts) directory.\r\nThese trained models have Hosted Inference API widgets that you can use to experiment with the models on their Hugging Face model pages. Additionally, Hugging Face provides each model with a free API (`Deploy` > `Inference API` on the model page).\r\n\r\nThese models are further elaborated on in my [thesis](https://raw.githubusercontent.com/tomaarsen/SpanMarkerNER/main/thesis.pdf).\r\n\r\n### FewNERD\r\n* [`tomaarsen/span-marker-bert-base-fewnerd-fine-super`](https://huggingface.co/tomaarsen/span-marker-bert-base-fewnerd-fine-super) is a model that I have trained in 2 hours on the finegrained, supervised [Few-NERD dataset](https://huggingface.co/datasets/DFKI-SLT/few-nerd). It reached a 70.53 Test F1, competitive in the all-time [Few-NERD leaderboard](https://paperswithcode.com/sota/named-entity-recognition-on-few-nerd-sup) using `bert-base`. My training script resembles the one that you can see above.\r\n\r\n* [`tomaarsen/span-marker-roberta-large-fewnerd-fine-super`](https://huggingface.co/tomaarsen/span-marker-roberta-large-fewnerd-fine-super) was trained in 6 hours on the finegrained, supervised [Few-NERD dataset](https://huggingface.co/datasets/DFKI-SLT/few-nerd) using `roberta-large`. It reached a 71.03 Test F1, reaching a new state of the art in the all-time [Few-NERD leaderboard](https://paperswithcode.com/sota/named-entity-recognition-on-few-nerd-sup).\r\n* [`tomaarsen/span-marker-xlm-roberta-base-fewnerd-fine-super`](https://huggingface.co/tomaarsen/span-marker-xlm-roberta-base-fewnerd-fine-super) is a multilingual model that I have trained in 1.5 hours on the finegrained, supervised [Few-NERD dataset](https://huggingface.co/datasets/DFKI-SLT/few-nerd). It reached a 68.6 Test F1 on English, and works well on other languages like Spanish, French, German, Russian, Dutch, Polish, Icelandic, Greek and many more.\r\n\r\n### OntoNotes v5.0\r\n* [`tomaarsen/span-marker-roberta-large-ontonotes5`](https://huggingface.co/tomaarsen/span-marker-roberta-large-ontonotes5) was trained in 3 hours on the OntoNotes v5.0 dataset, reaching a performance of 91.54 F1. For reference, the current strongest spaCy model (`en_core_web_trf`) reaches 89.8 F1. This SpanMarker model uses a `roberta-large` encoder under the hood.\r\n\r\n### CoNLL03\r\n* [`tomaarsen/span-marker-xlm-roberta-large-conll03`](https://huggingface.co/tomaarsen/span-marker-xlm-roberta-large-conll03) is a SpanMarker model using `xlm-roberta-large` that was trained in 45 minutes. It reaches a state of the art 93.1 F1 on CoNLL03 without using document-level context. For reference, the current strongest spaCy model (`en_core_web_trf`) reaches 91.6.\r\n* [`tomaarsen/span-marker-xlm-roberta-large-conll03-doc-context`](https://huggingface.co/tomaarsen/span-marker-xlm-roberta-large-conll03-doc-context) is another SpanMarker model using the `xlm-roberta-large` encoder. It uses [document-level context](https://tomaarsen.github.io/SpanMarkerNER/notebooks/document_level_context.html) to reach a state of the art 94.4 F1. For the best performance, inference should be performed using document-level context ([docs](https://tomaarsen.github.io/SpanMarkerNER/notebooks/document_level_context.html#Inference)). This model was trained in 1 hour.\r\n\r\n### CoNLL++\r\n* [`tomaarsen/span-marker-xlm-roberta-large-conllpp-doc-context`](https://huggingface.co/tomaarsen/span-marker-xlm-roberta-large-conllpp-doc-context) was trained in an hour using the `xlm-roberta-large` encoder on the CoNLL++ dataset. Using [document-level context](https://tomaarsen.github.io/SpanMarkerNER/notebooks/document_level_context.html), it reaches a very competitive 95.5 F1. For the best performance, inference should be performed using document-level context ([docs](https://tomaarsen.github.io/SpanMarkerNER/notebooks/document_level_context.html#Inference)).\r\n\r\n### MultiNERD\r\n* [`tomaarsen/span-marker-xlm-roberta-base-multinerd`](https://huggingface.co/tomaarsen/span-marker-xlm-roberta-base-multinerd) is a multilingual SpanMarker model using the `xlm-roberta-large` encoder trained on the huge [MultiNERD](https://huggingface.co/datasets/Babelscape/multinerd) dataset. It reaches a 91.31 F1 on all 10 training languages and 94.55 F1 on English only. The model can predict between 15 classes. For best performance, separate punctuation from your words as described [here](https://huggingface.co/tomaarsen/span-marker-xlm-roberta-base-multinerd#limitations). Note that [`tomaarsen/span-marker-mbert-base-multinerd`](https://huggingface.co/tomaarsen/span-marker-mbert-base-multinerd) does not have this limitation and performs better.\r\n\r\n* [`tomaarsen/span-marker-mbert-base-multinerd`](https://huggingface.co/tomaarsen/span-marker-mbert-base-multinerd) is the successor of [`tomaarsen/span-marker-xlm-roberta-base-multinerd`](https://huggingface.co/tomaarsen/span-marker-xlm-roberta-base-multinerd). It's a multilingual SpanMarker model using `bert-base-multilingual-cased` trained on the [MultiNERD](https://huggingface.co/datasets/Babelscape/multinerd) dataset. It reaches a state-of-the-art 92.48 F1 on all 10 training languages and 95.18 F1 on English only. This model generalizes well to languages using the Latin and Cyrillic script.\r\n\r\n## Using pretrained SpanMarker models with spaCy\r\nAll [SpanMarker models on the Hugging Face Hub](https://huggingface.co/models?library=span-marker) can also be easily used in spaCy. It's as simple as including 1 line to add the `span_marker` pipeline. See the [Documentation](https://tomaarsen.github.io/SpanMarkerNER/notebooks/spacy_integration.html) or [API Reference](https://tomaarsen.github.io/SpanMarkerNER/api/span_marker.spacy_integration.html) for more information.\r\n```python\r\nimport spacy\r\n\r\n# Load the spaCy model with the span_marker pipeline component\r\nnlp = spacy.load(\"en_core_web_sm\", exclude=[\"ner\"])\r\nnlp.add_pipe(\"span_marker\", config={\"model\": \"tomaarsen/span-marker-roberta-large-ontonotes5\"})\r\n\r\n# Feed some text through the model to get a spacy Doc\r\ntext = \"\"\"Cleopatra VII, also known as Cleopatra the Great, was the last active ruler of the \\\r\nPtolemaic Kingdom of Egypt. She was born in 69 BCE and ruled Egypt from 51 BCE until her \\\r\ndeath in 30 BCE.\"\"\"\r\ndoc = nlp(text)\r\n\r\n# And look at the entities\r\nprint([(entity, entity.label_) for entity in doc.ents])\r\n\"\"\"\r\n[(Cleopatra VII, \"PERSON\"), (Cleopatra the Great, \"PERSON\"), (the Ptolemaic Kingdom of Egypt, \"GPE\"),\r\n(69 BCE, \"DATE\"), (Egypt, \"GPE\"), (51 BCE, \"DATE\"), (30 BCE, \"DATE\")]\r\n\"\"\"\r\n```\r\n![image](https://user-images.githubusercontent.com/37621491/246170623-6351cb7e-bbb0-4472-af16-9a351a253da9.png)\r\n\r\n## Context\r\n<h1 align=\"center\">\r\n    <a href=\"https://github.com/argilla-io/argilla\">\r\n    <img src=\"https://github.com/dvsrepo/imgs/raw/main/rg.svg\" alt=\"Argilla\" width=\"150\">\r\n    </a>\r\n</h1>\r\n\r\nI have developed this library as a part of my thesis work at [Argilla](https://github.com/argilla-io/argilla). Feel free to read my finished thesis [here](https://raw.githubusercontent.com/tomaarsen/SpanMarkerNER/main/thesis.pdf) in this repository!\r\n\r\n## Changelog\r\nSee [CHANGELOG.md](CHANGELOG.md) for news on all SpanMarker versions.\r\n\r\n## License\r\nSee [LICENSE](LICENSE) for the current license.\r\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "Named Entity Recognition using Span Markers",
    "version": "1.7.0",
    "project_urls": {
        "Documentation": "https://tomaarsen.github.io/SpanMarkerNER",
        "Repository": "https://github.com/tomaarsen/SpanMarkerNER"
    },
    "split_keywords": [
        "data-science",
        " natural-language-processing",
        " artificial-intelligence",
        " mlops",
        " nlp",
        " machine-learning",
        " transformers"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9f2f5c029daf94f0040e2abad910fa7857c48eb722e73b66622f53c9ec4ff674",
                "md5": "68c1068b824cc913a0924192e0f4c1dc",
                "sha256": "b9b624ae8351f3e9ddd24270eef7e75b7ec5096d17688c39c4fe2342260dc7db"
            },
            "downloads": -1,
            "filename": "span_marker-1.7.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "68c1068b824cc913a0924192e0f4c1dc",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 50922,
            "upload_time": "2025-01-08T11:39:09",
            "upload_time_iso_8601": "2025-01-08T11:39:09.174718Z",
            "url": "https://files.pythonhosted.org/packages/9f/2f/5c029daf94f0040e2abad910fa7857c48eb722e73b66622f53c9ec4ff674/span_marker-1.7.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a24720e4c82290d2c2ff47dae5224f7db1f74597a2c69df9b8994e223060ee31",
                "md5": "f5fa7b50ff29e03963714dc609ec13a2",
                "sha256": "dc420ac4f04c6eb8ad616b9c199322802c31430ed2684d666f0fa134e3fb8fd6"
            },
            "downloads": -1,
            "filename": "span_marker-1.7.0.tar.gz",
            "has_sig": false,
            "md5_digest": "f5fa7b50ff29e03963714dc609ec13a2",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 56928,
            "upload_time": "2025-01-08T11:39:11",
            "upload_time_iso_8601": "2025-01-08T11:39:11.585407Z",
            "url": "https://files.pythonhosted.org/packages/a2/47/20e4c82290d2c2ff47dae5224f7db1f74597a2c69df9b8994e223060ee31/span_marker-1.7.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-08 11:39:11",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "tomaarsen",
    "github_project": "SpanMarkerNER",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "span-marker"
}

Tom Aarsen