nerpy

Name	nerpy JSON
Version	0.1.2 JSON
	download
home_page	https://github.com/shibing624/nerpy
Summary	nerpy: Named Entity Recognition toolkit using Python
upload_time	2024-01-24 13:27:58
maintainer
docs_url	None
author	XuMing
requires_python	>=3.8
license	Apache License 2.0
keywords	ner nerpy chinese named entity recognition tool ner bert bert2tag
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            [![PyPI version](https://badge.fury.io/py/nerpy.svg)](https://badge.fury.io/py/nerpy)
[![Downloads](https://pepy.tech/badge/nerpy)](https://pepy.tech/project/nerpy)
[![Contributions welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg)](CONTRIBUTING.md)
[![GitHub contributors](https://img.shields.io/github/contributors/shibing624/nerpy.svg)](https://github.com/shibing624/nerpy/graphs/contributors)
[![License Apache 2.0](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](LICENSE)
[![python_version](https://img.shields.io/badge/Python-3.8%2B-green.svg)](requirements.txt)
[![GitHub issues](https://img.shields.io/github/issues/shibing624/nerpy.svg)](https://github.com/shibing624/nerpy/issues)
[![Wechat Group](http://vlog.sfyc.ltd/wechat_everyday/wxgroup_logo.png?imageView2/0/w/60/h/20)](#Contact)

# NERpy
🌈 Implementation of Named Entity Recognition using Python. 

**nerpy**实现了BertSoftmax、BertCrf、BertSpan等多种命名实体识别模型，并在标准数据集上比较了各模型的效果。


**Guide**
- [Feature](#Feature)
- [Evaluation](#Evaluation)
- [Install](#install)
- [Usage](#usage)
- [Contact](#Contact)
- [Reference](#reference)


# Feature
### 命名实体识别模型
- [BertSoftmax](nerpy/ner_model.py)：BertSoftmax基于BERT预训练模型实现实体识别，本项目基于PyTorch实现了BertSoftmax模型的训练和预测
- [BertSpan](nerpy/bertspan.py)：BertSpan基于BERT训练span边界的表示，模型结构更适配实体边界识别，本项目基于PyTorch实现了BertSpan模型的训练和预测

# Evaluation

### 实体识别

- 英文实体识别数据集的评测结果：

| Arch | Backbone | Model Name | CoNLL-2003 | QPS |
| :-- | :--- | :--- | :-: | :--: |
| BertSoftmax | bert-base-uncased | bert4ner-base-uncased | 90.43 | 235 |
| BertSoftmax | bert-base-cased | bert4ner-base-cased | 91.17 | 235 |
| BertSpan | bert-base-uncased | bertspan4ner-base-uncased | 90.61 | 210 |
| BertSpan | bert-base-cased | bertspan4ner-base-cased | 91.90 | 224 |

- 中文实体识别数据集的评测结果：

| Arch | Backbone | Model Name | CNER | PEOPLE | MSRA-NER | QPS |
| :-- | :--- | :--- | :-: | :-: | :-: | :-: |
| BertSoftmax | bert-base-chinese | bert4ner-base-chinese | 94.98 | 95.25 | 94.65 | 222 |
| BertSpan | bert-base-chinese | bertspan4ner-base-chinese | 96.03 | 96.06 | 95.03 | 254 |

- 本项目release模型的实体识别评测结果：

| Arch | Backbone | Model Name | CNER(zh) | PEOPLE(zh) | CoNLL-2003(en) | QPS |
| :-- | :--- | :---- | :-: | :-: | :-: | :-: |
| BertSpan | bert-base-chinese | shibing624/bertspan4ner-base-chinese | 96.03 | 96.06 | - | 254 |
| BertSoftmax | bert-base-chinese | shibing624/bert4ner-base-chinese | 94.98 | 95.25 | - | 222 |
| BertSoftmax | bert-base-uncased | shibing624/bert4ner-base-uncased | - | - | 90.43 | 243 |

说明：
- 结果值均使用F1
- 结果均只用该数据集的train训练，在test上评估得到的表现，没用外部数据
- `shibing624/bertspan4ner-base-chinese`模型达到base级别里SOTA效果，是用BertSpan方法训练的，
 运行[examples/training_bertspan_zh_demo.py](examples/training_bertspan_zh_demo.py)代码可在各中文数据集复现结果
- `shibing624/bert4ner-base-chinese`模型达到base级别里较好效果，是用BertSoftmax方法训练的，
 运行[examples/training_ner_model_zh_demo.py](examples/training_ner_model_zh_demo.py)代码可在各中文数据集复现结果
- `shibing624/bert4ner-base-uncased`模型是用BertSoftmax方法训练的，
 运行[examples/training_ner_model_en_demo.py](examples/training_ner_model_en_demo.py)代码可在CoNLL-2003英文数据集复现结果
- 各预训练模型均可以通过transformers调用，如中文BERT模型：`--model_name bert-base-chinese`
- 中文实体识别数据集下载[链接见下方](#数据集)
- QPS的GPU测试环境是Tesla V100，显存32GB

# Demo

Demo: https://huggingface.co/spaces/shibing624/nerpy

![](docs/hf.png)

run example: [examples/gradio_demo.py](examples/gradio_demo.py) to see the demo:
```shell
python examples/gradio_demo.py
```

 
# Install
python 3.8+

```shell
pip install torch # conda install pytorch
pip install -U nerpy
```

or

```shell
pip install torch # conda install pytorch
pip install -r requirements.txt

git clone https://github.com/shibing624/nerpy.git
cd nerpy
pip install --no-deps .
```

# Usage

## 命名实体识别

#### 英文实体识别：

```shell
>>> from nerpy import NERModel
>>> model = NERModel("bert", "shibing624/bert4ner-base-uncased")
>>> predictions, raw_outputs, entities = model.predict(["AL-AIN, United Arab Emirates 1996-12-06"], split_on_space=True)
entities:  [('AL-AIN,', 'LOC'), ('United Arab Emirates', 'LOC')]
```

#### 中文实体识别：

```shell
>>> from nerpy import NERModel
>>> model = NERModel("bert", "shibing624/bert4ner-base-chinese")
>>> predictions, raw_outputs, entities = model.predict(["常建良，男，1963年出生，工科学士，高级工程师"], split_on_space=False)
entities: [('常建良', 'PER'), ('1963年', 'TIME')]
```

example: [examples/base_zh_demo.py](examples/base_zh_demo.py)

```python
import sys

sys.path.append('..')
from nerpy import NERModel

if __name__ == '__main__':
    # BertSoftmax中文实体识别模型: NERModel("bert", "shibing624/bert4ner-base-chinese")
    # BertSpan中文实体识别模型: NERModel("bertspan", "shibing624/bertspan4ner-base-chinese")
    model = NERModel("bert", "shibing624/bert4ner-base-chinese")
    sentences = [
        "常建良，男，1963年出生，工科学士，高级工程师，北京物资学院客座副教授",
        "1985年8月-1993年在国家物资局、物资部、国内贸易部金属材料流通司从事国家统配钢材中特种钢材品种的调拨分配工作，先后任科员、主任科员。"
    ]
    predictions, raw_outputs, entities = model.predict(sentences)
    print(entities)
```

output:
```
[('常建良', 'PER'), ('1963年', 'TIME'), ('北京物资学院', 'ORG')]
[('1985年', 'TIME'), ('8月', 'TIME'), ('1993年', 'TIME'), ('国家物资局', 'ORG'), ('物资部', 'ORG'), ('国内贸易部金属材料流通司', 'ORG')]
```

- `shibing624/bert4ner-base-chinese`模型是BertSoftmax方法在中文PEOPLE(人民日报)数据集训练得到的，模型已经上传到huggingface的
模型库[shibing624/bert4ner-base-chinese](https://huggingface.co/shibing624/bert4ner-base-chinese)，
是`nerpy.NERModel`指定的默认模型，可以通过上面示例调用，或者如下所示用[transformers库](https://github.com/huggingface/transformers)调用，
模型自动下载到本机路径：`~/.cache/huggingface/transformers`
- `shibing624/bertspan4ner-base-chinese`模型是BertSpan方法在中文PEOPLE(人民日报)数据集训练得到的，模型已经上传到huggingface的
模型库[shibing624/bertspan4ner-base-chinese](https://huggingface.co/shibing624/bertspan4ner-base-chinese)


#### Usage (HuggingFace Transformers)
Without [nerpy](https://github.com/shibing624/nerpy), you can use the model like this: 

First, you pass your input through the transformer model, then you have to apply the bio tag to get the entity words.

example: [examples/predict_use_origin_transformers_zh_demo.py](examples/predict_use_origin_transformers_zh_demo.py)

```python
import os
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
from seqeval.metrics.sequence_labeling import get_entities

os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("shibing624/bert4ner-base-chinese")
model = AutoModelForTokenClassification.from_pretrained("shibing624/bert4ner-base-chinese")
label_list = ['I-ORG', 'B-LOC', 'O', 'B-ORG', 'I-LOC', 'I-PER', 'B-TIME', 'I-TIME', 'B-PER']

sentence = "王宏伟来自北京，是个警察，喜欢去王府井游玩儿。"


def get_entity(sentence):
    tokens = tokenizer.tokenize(sentence)
    inputs = tokenizer.encode(sentence, return_tensors="pt")
    with torch.no_grad():
        outputs = model(inputs).logits
    predictions = torch.argmax(outputs, dim=2)
    char_tags = [(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].numpy())][1:-1]
    print(sentence)
    print(char_tags)

    pred_labels = [i[1] for i in char_tags]
    entities = []
    line_entities = get_entities(pred_labels)
    for i in line_entities:
        word = sentence[i[1]: i[2] + 1]
        entity_type = i[0]
        entities.append((word, entity_type))

    print("Sentence entity:")
    print(entities)


get_entity(sentence)
```
output:
```shell
王宏伟来自北京，是个警察，喜欢去王府井游玩儿。
[('王', 'B-PER'), ('宏', 'I-PER'), ('伟', 'I-PER'), ('来', 'O'), ('自', 'O'), ('北', 'B-LOC'), ('京', 'I-LOC'), ('，', 'O'), ('是', 'O'), ('个', 'O'), ('警', 'O'), ('察', 'O'), ('，', 'O'), ('喜', 'O'), ('欢', 'O'), ('去', 'O'), ('王', 'B-LOC'), ('府', 'I-LOC'), ('井', 'I-LOC'), ('游', 'O'), ('玩', 'O'), ('儿', 'O'), ('。', 'O')]
Sentence entity:
[('王宏伟', 'PER'), ('北京', 'LOC'), ('王府井', 'LOC')]
```

### 数据集

#### 实体识别数据集


| 数据集 | 语料 | 下载链接 | 文件大小 |
| :------- | :--------- | :---------: | :---------: |
| **`CNER中文实体识别数据集`** | CNER(12万字) | [CNER github](https://github.com/shibing624/nerpy/tree/main/examples/data/cner)| 1.1MB |
| **`PEOPLE中文实体识别数据集`** | 人民日报数据集（200万字） | [PEOPLE github](https://github.com/shibing624/nerpy/tree/main/examples/data/people)| 12.8MB |
| **`MSRA-NER中文实体识别数据集`** | MSRA-NER数据集（4.6万条，221.6万字） | [MSRA-NER github](https://github.com/shibing624/nerpy/releases/download/0.1.0/msra_ner.tar.gz)| 3.6MB |
| **`CoNLL03英文实体识别数据集`** | CoNLL-2003数据集（22万字） | [CoNLL03 github](https://github.com/shibing624/nerpy/tree/main/examples/data/conll03)| 1.7MB |


CNER中文实体识别数据集，数据格式：

```text
美	B-LOC
国	I-LOC
的	O
华	B-PER
莱	I-PER
士	I-PER

我	O
跟	O
他	O
```


## BertSoftmax 模型

BertSoftmax实体识别模型，基于BERT的标准序列标注方法：

Network structure:


<img src="docs/bert.png" width="500" />


模型文件组成：
```
shibing624/bert4ner-base-chinese
    ├── config.json
    ├── model_args.json
    ├── pytorch_model.bin
    ├── special_tokens_map.json
    ├── tokenizer_config.json
    └── vocab.txt
```

#### BertSoftmax 模型训练和预测

training example: [examples/training_ner_model_toy_demo.py](examples/bert_softmax_demo.py)


```python
import sys
import pandas as pd

sys.path.append('..')
from nerpy.ner_model import NERModel


# Creating samples
train_samples = [
    [0, "HuggingFace", "B-MISC"],
    [0, "Transformers", "I-MISC"],
    [0, "started", "O"],
    [0, "with", "O"],
    [0, "text", "O"],
    [0, "classification", "B-MISC"],
    [1, "Nerpy", "B-MISC"],
    [1, "Model", "I-MISC"],
    [1, "can", "O"],
    [1, "now", "O"],
    [1, "perform", "O"],
    [1, "NER", "B-MISC"],
]
train_data = pd.DataFrame(train_samples, columns=["sentence_id", "words", "labels"])

test_samples = [
    [0, "HuggingFace", "B-MISC"],
    [0, "Transformers", "I-MISC"],
    [0, "was", "O"],
    [0, "built", "O"],
    [0, "for", "O"],
    [0, "text", "O"],
    [0, "classification", "B-MISC"],
    [1, "Nerpy", "B-MISC"],
    [1, "Model", "I-MISC"],
    [1, "then", "O"],
    [1, "expanded", "O"],
    [1, "to", "O"],
    [1, "perform", "O"],
    [1, "NER", "B-MISC"],
]
test_data = pd.DataFrame(test_samples, columns=["sentence_id", "words", "labels"])

# Create a NERModel
model = NERModel(
    "bert",
    "bert-base-uncased",
    args={"overwrite_output_dir": True, "reprocess_input_data": True, "num_train_epochs": 1},
    use_cuda=False,
)

# Train the model
model.train_model(train_data)

# Evaluate the model
result, model_outputs, predictions = model.eval_model(test_data)
print(result, model_outputs, predictions)

# Predictions on text strings
sentences = ["Nerpy Model perform sentence NER", "HuggingFace Transformers build for text"]
predictions, raw_outputs, entities = model.predict(sentences, split_on_space=True)
print(predictions, entities)
```

- 在中文CNER数据集训练和评估`BertSoftmax`模型

example: [examples/training_ner_model_zh_demo.py](examples/training_ner_model_zh_demo.py)

```shell
cd examples
python training_ner_model_zh_demo.py --do_train --do_predict --num_epochs 5 --task_name cner
```
- 在英文CoNLL-2003数据集训练和评估`BertSoftmax`模型

example: [examples/training_ner_model_en_demo.py](examples/training_ner_model_en_demo.py)

```shell
cd examples
python training_ner_model_en_demo.py --do_train --do_predict --num_epochs 5
```


#### BertSpan 模型训练和预测

- 在中文CNER数据集训练和评估`BertSpan`模型

example: [examples/training_bertspan_zh_demo.py](examples/training_bertspan_zh_demo.py)

```shell
cd examples
python training_bertspan_zh_demo.py --do_train --do_predict --num_epochs 5 --task_name cner
```

# Contact

- Issue(建议)：[![GitHub issues](https://img.shields.io/github/issues/shibing624/nerpy.svg)](https://github.com/shibing624/nerpy/issues)
- 邮件我：xuming: xuming624@qq.com
- 微信我：
加我*微信号：xuming624, 备注：姓名-公司-NLP* 进NLP交流群。

<img src="docs/wechat.jpeg" width="200" />


# Citation

如果你在研究中使用了nerpy，请按如下格式引用：

APA:
```latex
Xu, M. nerpy: Named Entity Recognition Toolkit (Version 0.0.2) [Computer software]. https://github.com/shibing624/nerpy
```

BibTeX:
```latex
@software{Xu_nerpy_Text_to,
author = {Xu, Ming},
title = {{nerpy: Named Entity Recognition Toolkit}},
url = {https://github.com/shibing624/nerpy},
version = {0.0.2}
}
```

# License


授权协议为 [The Apache License 2.0](LICENSE)，可免费用做商业用途。请在产品说明中附加nerpy的链接和授权协议。


# Contribute
项目代码还很粗糙，如果大家对代码有所改进，欢迎提交回本项目，在提交之前，注意以下两点：

 - 在`tests`添加相应的单元测试
 - 使用`python -m pytest`来运行所有单元测试，确保所有单测都是通过的

之后即可提交PR。

# Reference
- [huggingface/transformers](https://github.com/huggingface/transformers)

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/shibing624/nerpy",
    "name": "nerpy",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "NER,nerpy,Chinese Named Entity Recognition Tool,ner,bert,bert2tag",
    "author": "XuMing",
    "author_email": "xuming624@qq.com",
    "download_url": "https://files.pythonhosted.org/packages/06/1d/8cd94ea6ba299eaf6610470c7783147c8cd4d447ba93f5485cf515523f6a/nerpy-0.1.2.tar.gz",
    "platform": null,
    "description": "[![PyPI version](https://badge.fury.io/py/nerpy.svg)](https://badge.fury.io/py/nerpy)\n[![Downloads](https://pepy.tech/badge/nerpy)](https://pepy.tech/project/nerpy)\n[![Contributions welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg)](CONTRIBUTING.md)\n[![GitHub contributors](https://img.shields.io/github/contributors/shibing624/nerpy.svg)](https://github.com/shibing624/nerpy/graphs/contributors)\n[![License Apache 2.0](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](LICENSE)\n[![python_version](https://img.shields.io/badge/Python-3.8%2B-green.svg)](requirements.txt)\n[![GitHub issues](https://img.shields.io/github/issues/shibing624/nerpy.svg)](https://github.com/shibing624/nerpy/issues)\n[![Wechat Group](http://vlog.sfyc.ltd/wechat_everyday/wxgroup_logo.png?imageView2/0/w/60/h/20)](#Contact)\n\n# NERpy\n\ud83c\udf08 Implementation of Named Entity Recognition using Python. \n\n**nerpy**\u5b9e\u73b0\u4e86BertSoftmax\u3001BertCrf\u3001BertSpan\u7b49\u591a\u79cd\u547d\u540d\u5b9e\u4f53\u8bc6\u522b\u6a21\u578b\uff0c\u5e76\u5728\u6807\u51c6\u6570\u636e\u96c6\u4e0a\u6bd4\u8f83\u4e86\u5404\u6a21\u578b\u7684\u6548\u679c\u3002\n\n\n**Guide**\n- [Feature](#Feature)\n- [Evaluation](#Evaluation)\n- [Install](#install)\n- [Usage](#usage)\n- [Contact](#Contact)\n- [Reference](#reference)\n\n\n# Feature\n### \u547d\u540d\u5b9e\u4f53\u8bc6\u522b\u6a21\u578b\n- [BertSoftmax](nerpy/ner_model.py)\uff1aBertSoftmax\u57fa\u4e8eBERT\u9884\u8bad\u7ec3\u6a21\u578b\u5b9e\u73b0\u5b9e\u4f53\u8bc6\u522b\uff0c\u672c\u9879\u76ee\u57fa\u4e8ePyTorch\u5b9e\u73b0\u4e86BertSoftmax\u6a21\u578b\u7684\u8bad\u7ec3\u548c\u9884\u6d4b\n- [BertSpan](nerpy/bertspan.py)\uff1aBertSpan\u57fa\u4e8eBERT\u8bad\u7ec3span\u8fb9\u754c\u7684\u8868\u793a\uff0c\u6a21\u578b\u7ed3\u6784\u66f4\u9002\u914d\u5b9e\u4f53\u8fb9\u754c\u8bc6\u522b\uff0c\u672c\u9879\u76ee\u57fa\u4e8ePyTorch\u5b9e\u73b0\u4e86BertSpan\u6a21\u578b\u7684\u8bad\u7ec3\u548c\u9884\u6d4b\n\n# Evaluation\n\n### \u5b9e\u4f53\u8bc6\u522b\n\n- \u82f1\u6587\u5b9e\u4f53\u8bc6\u522b\u6570\u636e\u96c6\u7684\u8bc4\u6d4b\u7ed3\u679c\uff1a\n\n| Arch | Backbone | Model Name | CoNLL-2003 | QPS |\n| :-- | :--- | :--- | :-: | :--: |\n| BertSoftmax | bert-base-uncased | bert4ner-base-uncased | 90.43 | 235 |\n| BertSoftmax | bert-base-cased | bert4ner-base-cased | 91.17 | 235 |\n| BertSpan | bert-base-uncased | bertspan4ner-base-uncased | 90.61 | 210 |\n| BertSpan | bert-base-cased | bertspan4ner-base-cased | 91.90 | 224 |\n\n- \u4e2d\u6587\u5b9e\u4f53\u8bc6\u522b\u6570\u636e\u96c6\u7684\u8bc4\u6d4b\u7ed3\u679c\uff1a\n\n| Arch | Backbone | Model Name | CNER | PEOPLE | MSRA-NER | QPS |\n| :-- | :--- | :--- | :-: | :-: | :-: | :-: |\n| BertSoftmax | bert-base-chinese | bert4ner-base-chinese | 94.98 | 95.25 | 94.65 | 222 |\n| BertSpan | bert-base-chinese | bertspan4ner-base-chinese | 96.03 | 96.06 | 95.03 | 254 |\n\n- \u672c\u9879\u76eerelease\u6a21\u578b\u7684\u5b9e\u4f53\u8bc6\u522b\u8bc4\u6d4b\u7ed3\u679c\uff1a\n\n| Arch | Backbone | Model Name | CNER(zh) | PEOPLE(zh) | CoNLL-2003(en) | QPS |\n| :-- | :--- | :---- | :-: | :-: | :-: | :-: |\n| BertSpan | bert-base-chinese | shibing624/bertspan4ner-base-chinese | 96.03 | 96.06 | - | 254 |\n| BertSoftmax | bert-base-chinese | shibing624/bert4ner-base-chinese | 94.98 | 95.25 | - | 222 |\n| BertSoftmax | bert-base-uncased | shibing624/bert4ner-base-uncased | - | - | 90.43 | 243 |\n\n\u8bf4\u660e\uff1a\n- \u7ed3\u679c\u503c\u5747\u4f7f\u7528F1\n- \u7ed3\u679c\u5747\u53ea\u7528\u8be5\u6570\u636e\u96c6\u7684train\u8bad\u7ec3\uff0c\u5728test\u4e0a\u8bc4\u4f30\u5f97\u5230\u7684\u8868\u73b0\uff0c\u6ca1\u7528\u5916\u90e8\u6570\u636e\n- `shibing624/bertspan4ner-base-chinese`\u6a21\u578b\u8fbe\u5230base\u7ea7\u522b\u91ccSOTA\u6548\u679c\uff0c\u662f\u7528BertSpan\u65b9\u6cd5\u8bad\u7ec3\u7684\uff0c\n \u8fd0\u884c[examples/training_bertspan_zh_demo.py](examples/training_bertspan_zh_demo.py)\u4ee3\u7801\u53ef\u5728\u5404\u4e2d\u6587\u6570\u636e\u96c6\u590d\u73b0\u7ed3\u679c\n- `shibing624/bert4ner-base-chinese`\u6a21\u578b\u8fbe\u5230base\u7ea7\u522b\u91cc\u8f83\u597d\u6548\u679c\uff0c\u662f\u7528BertSoftmax\u65b9\u6cd5\u8bad\u7ec3\u7684\uff0c\n \u8fd0\u884c[examples/training_ner_model_zh_demo.py](examples/training_ner_model_zh_demo.py)\u4ee3\u7801\u53ef\u5728\u5404\u4e2d\u6587\u6570\u636e\u96c6\u590d\u73b0\u7ed3\u679c\n- `shibing624/bert4ner-base-uncased`\u6a21\u578b\u662f\u7528BertSoftmax\u65b9\u6cd5\u8bad\u7ec3\u7684\uff0c\n \u8fd0\u884c[examples/training_ner_model_en_demo.py](examples/training_ner_model_en_demo.py)\u4ee3\u7801\u53ef\u5728CoNLL-2003\u82f1\u6587\u6570\u636e\u96c6\u590d\u73b0\u7ed3\u679c\n- \u5404\u9884\u8bad\u7ec3\u6a21\u578b\u5747\u53ef\u4ee5\u901a\u8fc7transformers\u8c03\u7528\uff0c\u5982\u4e2d\u6587BERT\u6a21\u578b\uff1a`--model_name bert-base-chinese`\n- \u4e2d\u6587\u5b9e\u4f53\u8bc6\u522b\u6570\u636e\u96c6\u4e0b\u8f7d[\u94fe\u63a5\u89c1\u4e0b\u65b9](#\u6570\u636e\u96c6)\n- QPS\u7684GPU\u6d4b\u8bd5\u73af\u5883\u662fTesla V100\uff0c\u663e\u5b5832GB\n\n# Demo\n\nDemo: https://huggingface.co/spaces/shibing624/nerpy\n\n![](docs/hf.png)\n\nrun example: [examples/gradio_demo.py](examples/gradio_demo.py) to see the demo:\n```shell\npython examples/gradio_demo.py\n```\n\n \n# Install\npython 3.8+\n\n```shell\npip install torch # conda install pytorch\npip install -U nerpy\n```\n\nor\n\n```shell\npip install torch # conda install pytorch\npip install -r requirements.txt\n\ngit clone https://github.com/shibing624/nerpy.git\ncd nerpy\npip install --no-deps .\n```\n\n# Usage\n\n## \u547d\u540d\u5b9e\u4f53\u8bc6\u522b\n\n#### \u82f1\u6587\u5b9e\u4f53\u8bc6\u522b\uff1a\n\n```shell\n>>> from nerpy import NERModel\n>>> model = NERModel(\"bert\", \"shibing624/bert4ner-base-uncased\")\n>>> predictions, raw_outputs, entities = model.predict([\"AL-AIN, United Arab Emirates 1996-12-06\"], split_on_space=True)\nentities:  [('AL-AIN,', 'LOC'), ('United Arab Emirates', 'LOC')]\n```\n\n#### \u4e2d\u6587\u5b9e\u4f53\u8bc6\u522b\uff1a\n\n```shell\n>>> from nerpy import NERModel\n>>> model = NERModel(\"bert\", \"shibing624/bert4ner-base-chinese\")\n>>> predictions, raw_outputs, entities = model.predict([\"\u5e38\u5efa\u826f\uff0c\u7537\uff0c1963\u5e74\u51fa\u751f\uff0c\u5de5\u79d1\u5b66\u58eb\uff0c\u9ad8\u7ea7\u5de5\u7a0b\u5e08\"], split_on_space=False)\nentities: [('\u5e38\u5efa\u826f', 'PER'), ('1963\u5e74', 'TIME')]\n```\n\nexample: [examples/base_zh_demo.py](examples/base_zh_demo.py)\n\n```python\nimport sys\n\nsys.path.append('..')\nfrom nerpy import NERModel\n\nif __name__ == '__main__':\n    # BertSoftmax\u4e2d\u6587\u5b9e\u4f53\u8bc6\u522b\u6a21\u578b: NERModel(\"bert\", \"shibing624/bert4ner-base-chinese\")\n    # BertSpan\u4e2d\u6587\u5b9e\u4f53\u8bc6\u522b\u6a21\u578b: NERModel(\"bertspan\", \"shibing624/bertspan4ner-base-chinese\")\n    model = NERModel(\"bert\", \"shibing624/bert4ner-base-chinese\")\n    sentences = [\n        \"\u5e38\u5efa\u826f\uff0c\u7537\uff0c1963\u5e74\u51fa\u751f\uff0c\u5de5\u79d1\u5b66\u58eb\uff0c\u9ad8\u7ea7\u5de5\u7a0b\u5e08\uff0c\u5317\u4eac\u7269\u8d44\u5b66\u9662\u5ba2\u5ea7\u526f\u6559\u6388\",\n        \"1985\u5e748\u6708-1993\u5e74\u5728\u56fd\u5bb6\u7269\u8d44\u5c40\u3001\u7269\u8d44\u90e8\u3001\u56fd\u5185\u8d38\u6613\u90e8\u91d1\u5c5e\u6750\u6599\u6d41\u901a\u53f8\u4ece\u4e8b\u56fd\u5bb6\u7edf\u914d\u94a2\u6750\u4e2d\u7279\u79cd\u94a2\u6750\u54c1\u79cd\u7684\u8c03\u62e8\u5206\u914d\u5de5\u4f5c\uff0c\u5148\u540e\u4efb\u79d1\u5458\u3001\u4e3b\u4efb\u79d1\u5458\u3002\"\n    ]\n    predictions, raw_outputs, entities = model.predict(sentences)\n    print(entities)\n```\n\noutput:\n```\n[('\u5e38\u5efa\u826f', 'PER'), ('1963\u5e74', 'TIME'), ('\u5317\u4eac\u7269\u8d44\u5b66\u9662', 'ORG')]\n[('1985\u5e74', 'TIME'), ('8\u6708', 'TIME'), ('1993\u5e74', 'TIME'), ('\u56fd\u5bb6\u7269\u8d44\u5c40', 'ORG'), ('\u7269\u8d44\u90e8', 'ORG'), ('\u56fd\u5185\u8d38\u6613\u90e8\u91d1\u5c5e\u6750\u6599\u6d41\u901a\u53f8', 'ORG')]\n```\n\n- `shibing624/bert4ner-base-chinese`\u6a21\u578b\u662fBertSoftmax\u65b9\u6cd5\u5728\u4e2d\u6587PEOPLE(\u4eba\u6c11\u65e5\u62a5)\u6570\u636e\u96c6\u8bad\u7ec3\u5f97\u5230\u7684\uff0c\u6a21\u578b\u5df2\u7ecf\u4e0a\u4f20\u5230huggingface\u7684\n\u6a21\u578b\u5e93[shibing624/bert4ner-base-chinese](https://huggingface.co/shibing624/bert4ner-base-chinese)\uff0c\n\u662f`nerpy.NERModel`\u6307\u5b9a\u7684\u9ed8\u8ba4\u6a21\u578b\uff0c\u53ef\u4ee5\u901a\u8fc7\u4e0a\u9762\u793a\u4f8b\u8c03\u7528\uff0c\u6216\u8005\u5982\u4e0b\u6240\u793a\u7528[transformers\u5e93](https://github.com/huggingface/transformers)\u8c03\u7528\uff0c\n\u6a21\u578b\u81ea\u52a8\u4e0b\u8f7d\u5230\u672c\u673a\u8def\u5f84\uff1a`~/.cache/huggingface/transformers`\n- `shibing624/bertspan4ner-base-chinese`\u6a21\u578b\u662fBertSpan\u65b9\u6cd5\u5728\u4e2d\u6587PEOPLE(\u4eba\u6c11\u65e5\u62a5)\u6570\u636e\u96c6\u8bad\u7ec3\u5f97\u5230\u7684\uff0c\u6a21\u578b\u5df2\u7ecf\u4e0a\u4f20\u5230huggingface\u7684\n\u6a21\u578b\u5e93[shibing624/bertspan4ner-base-chinese](https://huggingface.co/shibing624/bertspan4ner-base-chinese)\n\n\n#### Usage (HuggingFace Transformers)\nWithout [nerpy](https://github.com/shibing624/nerpy), you can use the model like this: \n\nFirst, you pass your input through the transformer model, then you have to apply the bio tag to get the entity words.\n\nexample: [examples/predict_use_origin_transformers_zh_demo.py](examples/predict_use_origin_transformers_zh_demo.py)\n\n```python\nimport os\nimport torch\nfrom transformers import AutoTokenizer, AutoModelForTokenClassification\nfrom seqeval.metrics.sequence_labeling import get_entities\n\nos.environ[\"KMP_DUPLICATE_LIB_OK\"] = \"TRUE\"\n\n# Load model from HuggingFace Hub\ntokenizer = AutoTokenizer.from_pretrained(\"shibing624/bert4ner-base-chinese\")\nmodel = AutoModelForTokenClassification.from_pretrained(\"shibing624/bert4ner-base-chinese\")\nlabel_list = ['I-ORG', 'B-LOC', 'O', 'B-ORG', 'I-LOC', 'I-PER', 'B-TIME', 'I-TIME', 'B-PER']\n\nsentence = \"\u738b\u5b8f\u4f1f\u6765\u81ea\u5317\u4eac\uff0c\u662f\u4e2a\u8b66\u5bdf\uff0c\u559c\u6b22\u53bb\u738b\u5e9c\u4e95\u6e38\u73a9\u513f\u3002\"\n\n\ndef get_entity(sentence):\n    tokens = tokenizer.tokenize(sentence)\n    inputs = tokenizer.encode(sentence, return_tensors=\"pt\")\n    with torch.no_grad():\n        outputs = model(inputs).logits\n    predictions = torch.argmax(outputs, dim=2)\n    char_tags = [(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].numpy())][1:-1]\n    print(sentence)\n    print(char_tags)\n\n    pred_labels = [i[1] for i in char_tags]\n    entities = []\n    line_entities = get_entities(pred_labels)\n    for i in line_entities:\n        word = sentence[i[1]: i[2] + 1]\n        entity_type = i[0]\n        entities.append((word, entity_type))\n\n    print(\"Sentence entity:\")\n    print(entities)\n\n\nget_entity(sentence)\n```\noutput:\n```shell\n\u738b\u5b8f\u4f1f\u6765\u81ea\u5317\u4eac\uff0c\u662f\u4e2a\u8b66\u5bdf\uff0c\u559c\u6b22\u53bb\u738b\u5e9c\u4e95\u6e38\u73a9\u513f\u3002\n[('\u738b', 'B-PER'), ('\u5b8f', 'I-PER'), ('\u4f1f', 'I-PER'), ('\u6765', 'O'), ('\u81ea', 'O'), ('\u5317', 'B-LOC'), ('\u4eac', 'I-LOC'), ('\uff0c', 'O'), ('\u662f', 'O'), ('\u4e2a', 'O'), ('\u8b66', 'O'), ('\u5bdf', 'O'), ('\uff0c', 'O'), ('\u559c', 'O'), ('\u6b22', 'O'), ('\u53bb', 'O'), ('\u738b', 'B-LOC'), ('\u5e9c', 'I-LOC'), ('\u4e95', 'I-LOC'), ('\u6e38', 'O'), ('\u73a9', 'O'), ('\u513f', 'O'), ('\u3002', 'O')]\nSentence entity:\n[('\u738b\u5b8f\u4f1f', 'PER'), ('\u5317\u4eac', 'LOC'), ('\u738b\u5e9c\u4e95', 'LOC')]\n```\n\n### \u6570\u636e\u96c6\n\n#### \u5b9e\u4f53\u8bc6\u522b\u6570\u636e\u96c6\n\n\n| \u6570\u636e\u96c6 | \u8bed\u6599 | \u4e0b\u8f7d\u94fe\u63a5 | \u6587\u4ef6\u5927\u5c0f |\n| :------- | :--------- | :---------: | :---------: |\n| **`CNER\u4e2d\u6587\u5b9e\u4f53\u8bc6\u522b\u6570\u636e\u96c6`** | CNER(12\u4e07\u5b57) | [CNER github](https://github.com/shibing624/nerpy/tree/main/examples/data/cner)| 1.1MB |\n| **`PEOPLE\u4e2d\u6587\u5b9e\u4f53\u8bc6\u522b\u6570\u636e\u96c6`** | \u4eba\u6c11\u65e5\u62a5\u6570\u636e\u96c6\uff08200\u4e07\u5b57\uff09 | [PEOPLE github](https://github.com/shibing624/nerpy/tree/main/examples/data/people)| 12.8MB |\n| **`MSRA-NER\u4e2d\u6587\u5b9e\u4f53\u8bc6\u522b\u6570\u636e\u96c6`** | MSRA-NER\u6570\u636e\u96c6\uff084.6\u4e07\u6761\uff0c221.6\u4e07\u5b57\uff09 | [MSRA-NER github](https://github.com/shibing624/nerpy/releases/download/0.1.0/msra_ner.tar.gz)| 3.6MB |\n| **`CoNLL03\u82f1\u6587\u5b9e\u4f53\u8bc6\u522b\u6570\u636e\u96c6`** | CoNLL-2003\u6570\u636e\u96c6\uff0822\u4e07\u5b57\uff09 | [CoNLL03 github](https://github.com/shibing624/nerpy/tree/main/examples/data/conll03)| 1.7MB |\n\n\nCNER\u4e2d\u6587\u5b9e\u4f53\u8bc6\u522b\u6570\u636e\u96c6\uff0c\u6570\u636e\u683c\u5f0f\uff1a\n\n```text\n\u7f8e\tB-LOC\n\u56fd\tI-LOC\n\u7684\tO\n\u534e\tB-PER\n\u83b1\tI-PER\n\u58eb\tI-PER\n\n\u6211\tO\n\u8ddf\tO\n\u4ed6\tO\n```\n\n\n## BertSoftmax \u6a21\u578b\n\nBertSoftmax\u5b9e\u4f53\u8bc6\u522b\u6a21\u578b\uff0c\u57fa\u4e8eBERT\u7684\u6807\u51c6\u5e8f\u5217\u6807\u6ce8\u65b9\u6cd5\uff1a\n\nNetwork structure:\n\n\n<img src=\"docs/bert.png\" width=\"500\" />\n\n\n\u6a21\u578b\u6587\u4ef6\u7ec4\u6210\uff1a\n```\nshibing624/bert4ner-base-chinese\n    \u251c\u2500\u2500 config.json\n    \u251c\u2500\u2500 model_args.json\n    \u251c\u2500\u2500 pytorch_model.bin\n    \u251c\u2500\u2500 special_tokens_map.json\n    \u251c\u2500\u2500 tokenizer_config.json\n    \u2514\u2500\u2500 vocab.txt\n```\n\n#### BertSoftmax \u6a21\u578b\u8bad\u7ec3\u548c\u9884\u6d4b\n\ntraining example: [examples/training_ner_model_toy_demo.py](examples/bert_softmax_demo.py)\n\n\n```python\nimport sys\nimport pandas as pd\n\nsys.path.append('..')\nfrom nerpy.ner_model import NERModel\n\n\n# Creating samples\ntrain_samples = [\n    [0, \"HuggingFace\", \"B-MISC\"],\n    [0, \"Transformers\", \"I-MISC\"],\n    [0, \"started\", \"O\"],\n    [0, \"with\", \"O\"],\n    [0, \"text\", \"O\"],\n    [0, \"classification\", \"B-MISC\"],\n    [1, \"Nerpy\", \"B-MISC\"],\n    [1, \"Model\", \"I-MISC\"],\n    [1, \"can\", \"O\"],\n    [1, \"now\", \"O\"],\n    [1, \"perform\", \"O\"],\n    [1, \"NER\", \"B-MISC\"],\n]\ntrain_data = pd.DataFrame(train_samples, columns=[\"sentence_id\", \"words\", \"labels\"])\n\ntest_samples = [\n    [0, \"HuggingFace\", \"B-MISC\"],\n    [0, \"Transformers\", \"I-MISC\"],\n    [0, \"was\", \"O\"],\n    [0, \"built\", \"O\"],\n    [0, \"for\", \"O\"],\n    [0, \"text\", \"O\"],\n    [0, \"classification\", \"B-MISC\"],\n    [1, \"Nerpy\", \"B-MISC\"],\n    [1, \"Model\", \"I-MISC\"],\n    [1, \"then\", \"O\"],\n    [1, \"expanded\", \"O\"],\n    [1, \"to\", \"O\"],\n    [1, \"perform\", \"O\"],\n    [1, \"NER\", \"B-MISC\"],\n]\ntest_data = pd.DataFrame(test_samples, columns=[\"sentence_id\", \"words\", \"labels\"])\n\n# Create a NERModel\nmodel = NERModel(\n    \"bert\",\n    \"bert-base-uncased\",\n    args={\"overwrite_output_dir\": True, \"reprocess_input_data\": True, \"num_train_epochs\": 1},\n    use_cuda=False,\n)\n\n# Train the model\nmodel.train_model(train_data)\n\n# Evaluate the model\nresult, model_outputs, predictions = model.eval_model(test_data)\nprint(result, model_outputs, predictions)\n\n# Predictions on text strings\nsentences = [\"Nerpy Model perform sentence NER\", \"HuggingFace Transformers build for text\"]\npredictions, raw_outputs, entities = model.predict(sentences, split_on_space=True)\nprint(predictions, entities)\n```\n\n- \u5728\u4e2d\u6587CNER\u6570\u636e\u96c6\u8bad\u7ec3\u548c\u8bc4\u4f30`BertSoftmax`\u6a21\u578b\n\nexample: [examples/training_ner_model_zh_demo.py](examples/training_ner_model_zh_demo.py)\n\n```shell\ncd examples\npython training_ner_model_zh_demo.py --do_train --do_predict --num_epochs 5 --task_name cner\n```\n- \u5728\u82f1\u6587CoNLL-2003\u6570\u636e\u96c6\u8bad\u7ec3\u548c\u8bc4\u4f30`BertSoftmax`\u6a21\u578b\n\nexample: [examples/training_ner_model_en_demo.py](examples/training_ner_model_en_demo.py)\n\n```shell\ncd examples\npython training_ner_model_en_demo.py --do_train --do_predict --num_epochs 5\n```\n\n\n#### BertSpan \u6a21\u578b\u8bad\u7ec3\u548c\u9884\u6d4b\n\n- \u5728\u4e2d\u6587CNER\u6570\u636e\u96c6\u8bad\u7ec3\u548c\u8bc4\u4f30`BertSpan`\u6a21\u578b\n\nexample: [examples/training_bertspan_zh_demo.py](examples/training_bertspan_zh_demo.py)\n\n```shell\ncd examples\npython training_bertspan_zh_demo.py --do_train --do_predict --num_epochs 5 --task_name cner\n```\n\n# Contact\n\n- Issue(\u5efa\u8bae)\uff1a[![GitHub issues](https://img.shields.io/github/issues/shibing624/nerpy.svg)](https://github.com/shibing624/nerpy/issues)\n- \u90ae\u4ef6\u6211\uff1axuming: xuming624@qq.com\n- \u5fae\u4fe1\u6211\uff1a\n\u52a0\u6211*\u5fae\u4fe1\u53f7\uff1axuming624, \u5907\u6ce8\uff1a\u59d3\u540d-\u516c\u53f8-NLP* \u8fdbNLP\u4ea4\u6d41\u7fa4\u3002\n\n<img src=\"docs/wechat.jpeg\" width=\"200\" />\n\n\n# Citation\n\n\u5982\u679c\u4f60\u5728\u7814\u7a76\u4e2d\u4f7f\u7528\u4e86nerpy\uff0c\u8bf7\u6309\u5982\u4e0b\u683c\u5f0f\u5f15\u7528\uff1a\n\nAPA:\n```latex\nXu, M. nerpy: Named Entity Recognition Toolkit (Version 0.0.2) [Computer software]. https://github.com/shibing624/nerpy\n```\n\nBibTeX:\n```latex\n@software{Xu_nerpy_Text_to,\nauthor = {Xu, Ming},\ntitle = {{nerpy: Named Entity Recognition Toolkit}},\nurl = {https://github.com/shibing624/nerpy},\nversion = {0.0.2}\n}\n```\n\n# License\n\n\n\u6388\u6743\u534f\u8bae\u4e3a [The Apache License 2.0](LICENSE)\uff0c\u53ef\u514d\u8d39\u7528\u505a\u5546\u4e1a\u7528\u9014\u3002\u8bf7\u5728\u4ea7\u54c1\u8bf4\u660e\u4e2d\u9644\u52a0nerpy\u7684\u94fe\u63a5\u548c\u6388\u6743\u534f\u8bae\u3002\n\n\n# Contribute\n\u9879\u76ee\u4ee3\u7801\u8fd8\u5f88\u7c97\u7cd9\uff0c\u5982\u679c\u5927\u5bb6\u5bf9\u4ee3\u7801\u6709\u6240\u6539\u8fdb\uff0c\u6b22\u8fce\u63d0\u4ea4\u56de\u672c\u9879\u76ee\uff0c\u5728\u63d0\u4ea4\u4e4b\u524d\uff0c\u6ce8\u610f\u4ee5\u4e0b\u4e24\u70b9\uff1a\n\n - \u5728`tests`\u6dfb\u52a0\u76f8\u5e94\u7684\u5355\u5143\u6d4b\u8bd5\n - \u4f7f\u7528`python -m pytest`\u6765\u8fd0\u884c\u6240\u6709\u5355\u5143\u6d4b\u8bd5\uff0c\u786e\u4fdd\u6240\u6709\u5355\u6d4b\u90fd\u662f\u901a\u8fc7\u7684\n\n\u4e4b\u540e\u5373\u53ef\u63d0\u4ea4PR\u3002\n\n# Reference\n- [huggingface/transformers](https://github.com/huggingface/transformers)\n\n\n",
    "bugtrack_url": null,
    "license": "Apache License 2.0",
    "summary": "nerpy: Named Entity Recognition toolkit using Python",
    "version": "0.1.2",
    "project_urls": {
        "Homepage": "https://github.com/shibing624/nerpy"
    },
    "split_keywords": [
        "ner",
        "nerpy",
        "chinese named entity recognition tool",
        "ner",
        "bert",
        "bert2tag"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "061d8cd94ea6ba299eaf6610470c7783147c8cd4d447ba93f5485cf515523f6a",
                "md5": "019ed23890a6825fa1aa13de3855dd35",
                "sha256": "f901ddd2b12cd64b13400649b9e2e4e7b2935fb24b72a9f94ba4f525ed91496d"
            },
            "downloads": -1,
            "filename": "nerpy-0.1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "019ed23890a6825fa1aa13de3855dd35",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 48903,
            "upload_time": "2024-01-24T13:27:58",
            "upload_time_iso_8601": "2024-01-24T13:27:58.134238Z",
            "url": "https://files.pythonhosted.org/packages/06/1d/8cd94ea6ba299eaf6610470c7783147c8cd4d447ba93f5485cf515523f6a/nerpy-0.1.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-01-24 13:27:58",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "shibing624",
    "github_project": "nerpy",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "nerpy"
}

XuMing