ckip-transformers


Nameckip-transformers JSON
Version 0.3.4 PyPI version JSON
download
home_pagehttps://ckip-transformers.readthedocs.io
SummaryCKIP Transformers
upload_time2023-04-09 13:03:05
maintainer
docs_urlNone
authorMu Yang
requires_python>=3.6
licenseGPL-3.0
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            CKIP Transformers
-----------------

| This project provides traditional Chinese transformers models (including ALBERT, BERT, GPT2) and NLP tools (including word segmentation, part-of-speech tagging, named entity recognition).
| 這個專案提供了繁體中文的 transformers 模型(包含 ALBERT、BERT、GPT2)及自然語言處理工具(包含斷詞、詞性標記、實體辨識)。

Git
^^^

| https://github.com/ckiplab/ckip-transformers
| |GitHub Version| |GitHub License| |GitHub Release| |GitHub Issues|

.. |GitHub Version| image:: https://img.shields.io/github/v/release/ckiplab/ckip-transformers.svg?cacheSeconds=3600
   :target: https://github.com/ckiplab/ckip-transformers/releases

.. |GitHub License| image:: https://img.shields.io/github/license/ckiplab/ckip-transformers.svg?cacheSeconds=3600
   :target: https://github.com/ckiplab/ckip-transformers/blob/master/LICENSE

.. |GitHub Release| image:: https://img.shields.io/github/release-date/ckiplab/ckip-transformers.svg?cacheSeconds=3600

.. |GitHub Downloads| image:: https://img.shields.io/github/downloads/ckiplab/ckip-transformers/total.svg?cacheSeconds=3600
   :target: https://github.com/ckiplab/ckip-transformers/releases/latest

.. |GitHub Issues| image:: https://img.shields.io/github/issues/ckiplab/ckip-transformers.svg?cacheSeconds=3600
   :target: https://github.com/ckiplab/ckip-transformers/issues

.. |GitHub Forks| image:: https://img.shields.io/github/forks/ckiplab/ckip-transformers.svg?style=social&label=Fork&cacheSeconds=3600

.. |GitHub Stars| image:: https://img.shields.io/github/stars/ckiplab/ckip-transformers.svg?style=social&label=Star&cacheSeconds=3600

.. |GitHub Watchers| image:: https://img.shields.io/github/watchers/ckiplab/ckip-transformers.svg?style=social&label=Watch&cacheSeconds=3600

PyPI
^^^^

| https://pypi.org/project/ckip-transformers
| |PyPI Version| |PyPI License| |PyPI Downloads| |PyPI Python| |PyPI Implementation| |PyPI Format| |PyPI Status|

.. |PyPI Version| image:: https://img.shields.io/pypi/v/ckip-transformers.svg?cacheSeconds=3600
   :target: https://pypi.org/project/ckip-transformers

.. |PyPI License| image:: https://img.shields.io/pypi/l/ckip-transformers.svg?cacheSeconds=3600
   :target: https://github.com/ckiplab/ckip-transformers/blob/master/LICENSE

.. |PyPI Downloads| image:: https://img.shields.io/pypi/dm/ckip-transformers.svg?cacheSeconds=3600
   :target: https://pypi.org/project/ckip-transformers#files

.. |PyPI Python| image:: https://img.shields.io/pypi/pyversions/ckip-transformers.svg?cacheSeconds=3600

.. |PyPI Implementation| image:: https://img.shields.io/pypi/implementation/ckip-transformers.svg?cacheSeconds=3600

.. |PyPI Format| image:: https://img.shields.io/pypi/format/ckip-transformers.svg?cacheSeconds=3600

.. |PyPI Status| image:: https://img.shields.io/pypi/status/ckip-transformers.svg?cacheSeconds=3600

Documentation
^^^^^^^^^^^^^

| https://ckip-transformers.readthedocs.io
| |ReadTheDocs Home|

.. |ReadTheDocs Home| image:: https://img.shields.io/website/https/ckip-transformers.readthedocs.io.svg?cacheSeconds=3600&up_message=online&down_message=offline
   :target: https://ckip-transformers.readthedocs.io

Demo
^^^^

| https://ckip.iis.sinica.edu.tw/service/transformers
| |Transformers Demo|

.. |Transformers Demo| image:: https://img.shields.io/website/https/ckip.iis.sinica.edu.tw/service/transformers.svg?cacheSeconds=3600&up_message=online&down_message=offline
   :target: https://ckip.iis.sinica.edu.tw/service/transformers

Contributers
^^^^^^^^^^^^

* `Mu Yang <https://muyang.pro>`__ at `CKIP <https://ckip.iis.sinica.edu.tw>`__ (Author & Maintainer).
* `Wei-Yun Ma <https://www.iis.sinica.edu.tw/pages/ma/>`__ at `CKIP <https://ckip.iis.sinica.edu.tw>`__ (Maintainer).

Related Packages
^^^^^^^^^^^^^^^^

- `CkipTagger <https://github.com/ckiplab/ckiptagger>`_: An alternative Chinese NLP library with using BiLSTM.
- `CKIP CoreNLP Toolkit <https://github.com/ckiplab/ckipnlp>`_: A Chinese NLP library with more NLP tasks and utilities.

Models
------

| You may also use our pretrained models with HuggingFace transformers library directly: https://huggingface.co/ckiplab/.
| 您可於 https://huggingface.co/ckiplab/ 下載預訓練的模型。

- Language Models
   * `ALBERT Tiny <https://huggingface.co/ckiplab/albert-tiny-chinese>`_: ``ckiplab/albert-tiny-chinese``
   * `ALBERT Base <https://huggingface.co/ckiplab/albert-base-chinese>`_: ``ckiplab/albert-base-chinese``
   * `BERT Tiny <https://huggingface.co/ckiplab/bert-tiny-chinese>`_: ``ckiplab/bert-tiny-chinese``
   * `BERT Base <https://huggingface.co/ckiplab/bert-base-chinese>`_: ``ckiplab/bert-base-chinese``
   * `GPT2 Tiny <https://huggingface.co/ckiplab/gpt2-tiny-chinese>`_: ``ckiplab/gpt2-tiny-chinese``
   * `GPT2 Base <https://huggingface.co/ckiplab/gpt2-base-chinese>`_: ``ckiplab/gpt2-base-chinese``

- NLP Task Models
   * `ALBERT Tiny — Word Segmentation <https://huggingface.co/ckiplab/albert-tiny-chinese-ws>`_: ``ckiplab/albert-tiny-chinese-ws``
   * `ALBERT Tiny — Part-of-Speech Tagging <https://huggingface.co/ckiplab/albert-tiny-chinese-pos>`_: ``ckiplab/albert-tiny-chinese-pos``
   * `ALBERT Tiny — Named-Entity Recognition <https://huggingface.co/ckiplab/albert-tiny-chinese-ner>`_: ``ckiplab/albert-tiny-chinese-ner``
   * `ALBERT Base — Word Segmentation <https://huggingface.co/ckiplab/albert-base-chinese-ws>`_: ``ckiplab/albert-base-chinese-ws``
   * `ALBERT Base — Part-of-Speech Tagging <https://huggingface.co/ckiplab/albert-base-chinese-pos>`_: ``ckiplab/albert-base-chinese-pos``
   * `ALBERT Base — Named-Entity Recognition <https://huggingface.co/ckiplab/albert-base-chinese-ner>`_: ``ckiplab/albert-base-chinese-ner``
   * `BERT Tiny — Word Segmentation <https://huggingface.co/ckiplab/bert-tiny-chinese-ws>`_: ``ckiplab/bert-tiny-chinese-ws``
   * `BERT Tiny — Part-of-Speech Tagging <https://huggingface.co/ckiplab/bert-tiny-chinese-pos>`_: ``ckiplab/bert-tiny-chinese-pos``
   * `BERT Tiny — Named-Entity Recognition <https://huggingface.co/ckiplab/bert-tiny-chinese-ner>`_: ``ckiplab/bert-tiny-chinese-ner``
   * `BERT Base — Word Segmentation <https://huggingface.co/ckiplab/bert-base-chinese-ws>`_: ``ckiplab/bert-base-chinese-ws``
   * `BERT Base — Part-of-Speech Tagging <https://huggingface.co/ckiplab/bert-base-chinese-pos>`_: ``ckiplab/bert-base-chinese-pos``
   * `BERT Base — Named-Entity Recognition <https://huggingface.co/ckiplab/bert-base-chinese-ner>`_: ``ckiplab/bert-base-chinese-ner``

Model Usage
^^^^^^^^^^^

| You may use our model directly from the HuggingFace's transformers library.
| 您可直接透過 HuggingFace's transformers 套件使用我們的模型。

.. code-block:: bash

   pip install -U transformers

| Please use BertTokenizerFast as tokenizer, and replace ``ckiplab/albert-tiny-chinese`` and ``ckiplab/albert-tiny-chinese-ws`` by any model you need in the following example.
| 請使用內建的 BertTokenizerFast,並將以下範例中的 ``ckiplab/albert-tiny-chinese`` 與 ``ckiplab/albert-tiny-chinese-ws`` 替換成任何您要使用的模型名稱。

.. code-block:: python

   from transformers import (
      BertTokenizerFast,
      AutoModelForMaskedLM,
      AutoModelForCausalLM,
      AutoModelForTokenClassification,
   )

   # masked language model (ALBERT, BERT)
   tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
   model = AutoModelForMaskedLM.from_pretrained('ckiplab/albert-tiny-chinese') # or other models above

   # casual language model (GPT2)
   tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
   model = AutoModelForCausalLM.from_pretrained('ckiplab/gpt2-base-chinese') # or other models above

   # nlp task model
   tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
   model = AutoModelForTokenClassification.from_pretrained('ckiplab/albert-tiny-chinese-ws') # or other models above

Model Fine-Tunning
^^^^^^^^^^^^^^^^^^

| To fine tunning our model on your own datasets, please refer to the following example from HuggingFace's transformers.
| 您可參考以下的範例去微調我們的模型於您自己的資料集。

- https://github.com/huggingface/transformers/tree/master/examples
- https://github.com/huggingface/transformers/tree/master/examples/pytorch/language-modeling
- https://github.com/huggingface/transformers/tree/master/examples/pytorch/token-classification

| Remember to set ``--tokenizer_name bert-base-chinese`` in order to use Chinese tokenizer.
| 記得設置 ``--tokenizer_name bert-base-chinese`` 以正確的使用中文的 tokenizer。

.. code-block:: bash

   python run_mlm.py \
      --model_name_or_path ckiplab/albert-tiny-chinese \ # or other models above
      --tokenizer_name bert-base-chinese \
      ...

   python run_ner.py \
      --model_name_or_path ckiplab/albert-tiny-chinese-ws \ # or other models above
      --tokenizer_name bert-base-chinese \
      ...

Model Performance
^^^^^^^^^^^^^^^^^

| The following is a performance comparison between our model and other models.
| The results are tested on a traditional Chinese corpus.
| 以下是我們的模型與其他的模型之性能比較。
| 各個任務皆測試於繁體中文的測試集。

================================  ===========  ===========  ========  ==========  =========
Model                             #Parameters  Perplexity†  WS (F1)‡  POS (ACC)‡  NER (F1)‡
================================  ===========  ===========  ========  ==========  =========
ckiplab/albert-tiny-chinese         4M          4.80        96.66%    94.48%      71.17%
ckiplab/albert-base-chinese        11M          2.65        97.33%    95.30%      79.47%
ckiplab/bert-tiny-chinese          12M          8.07        96.98%    95.11%      74.21%
ckiplab/bert-base-chinese         102M          1.88        97.60%    95.67%      81.18%
ckiplab/gpt2-tiny-chinese           4M         16.94        --        --          --
ckiplab/gpt2-base-chinese         102M          8.36        --        --          --
--------------------------------  -----------  -----------  --------  ----------  ---------

--------------------------------  -----------  -----------  --------  ----------  ---------
voidful/albert_chinese_tiny         4M         74.93        --        --          --
voidful/albert_chinese_base        11M         22.34        --        --          --
bert-base-chinese                 102M          2.53        --        --          --
================================  ===========  ===========  ========  ==========  =========

| † Perplexity; the smaller the better.
| † 混淆度;數字越小越好。
| ‡ WS: word segmentation; POS: part-of-speech; NER: named-entity recognition; the larger the better.
| ‡ WS: 斷詞;POS: 詞性標記;NER: 實體辨識;數字越大越好。

Training Corpus
^^^^^^^^^^^^^^^

| The language models are trained on the ZhWiki and CNA datasets; the WS and POS tasks are trained on the ASBC dataset; the NER tasks are trained on the OntoNotes dataset.
| 以上的語言模型訓練於 ZhWiki 與 CNA 資料集上;斷詞(WS)與詞性標記(POS)任務模型訓練於 ASBC 資料集上;實體辨識(NER)任務模型訓練於 OntoNotes 資料集上。

* ZhWiki: https://dumps.wikimedia.org/zhwiki/
   | Chinese Wikipedia text (20200801 dump), translated to Traditional using `OpenCC <https://github.com/BYVoid/OpenCC>`_.
   | 中文維基的文章(20200801 版本),利用 `OpenCC <https://github.com/BYVoid/OpenCC>`_ 翻譯成繁體中文。
* CNA: https://catalog.ldc.upenn.edu/LDC2011T13
   | Chinese Gigaword Fifth Edition — CNA (Central News Agency) part.
   | 中文 Gigaword 第五版 — CNA(中央社)的部分。
* ASBC: http://asbc.iis.sinica.edu.tw
   | Academia Sinica Balanced Corpus of Modern Chinese release 4.0.
   | 中央研究院漢語平衡語料庫第四版。
* OntoNotes: https://catalog.ldc.upenn.edu/LDC2013T19
   | OntoNotes release 5.0, Chinese part, translated to Traditional using `OpenCC <https://github.com/BYVoid/OpenCC>`_.
   | OntoNotes 第五版,中文部分,利用 `OpenCC <https://github.com/BYVoid/OpenCC>`_ 翻譯成繁體中文。

| Here is a summary of each corpus.
| 以下是各個資料集的一覽表。

================  ================  ================  ================  ================
Dataset           #Documents        #Lines            #Characters       Line Type
================  ================  ================  ================  ================
CNA               2,559,520         13,532,445        1,219,029,974     Paragraph
ZhWiki            1,106,783         5,918,975         495,446,829       Paragraph
ASBC              19,247            1,395,949         17,572,374        Clause
OntoNotes         1,911             48,067            1,568,491         Sentence
================  ================  ================  ================  ================

| Here is the dataset split used for language models.
| 以下是用於訓練語言模型的資料集切割。

================  ================  ================  ================
CNA+ZhWiki        #Documents        #Lines            #Characters
================  ================  ================  ================
Train             3,606,303         18,986,238        4,347,517,682
Dev               30,000            148,077           32,888,978
Test              30,000            151,241           35,216,818
================  ================  ================  ================

| Here is the dataset split used for word segmentation and part-of-speech tagging models.
| 以下是用於訓練斷詞及詞性標記模型的資料集切割。

================  ================  ================  ================  ================
ASBC              #Documents        #Lines            #Words            #Characters
================  ================  ================  ================  ================
Train             15,247            1,183,260         9,480,899         14,724,250
Dev               2,000             52,677            448,964           741,323
Test              2,000             160,012           1,315,129         2,106,799
================  ================  ================  ================  ================


| Here is the dataset split used for word segmentation and named entity recognition models.
| 以下是用於訓練實體辨識模型的資料集切割。

================  ================  ================  ================  ================
OntoNotes         #Documents        #Lines            #Characters       #Named-Entities
================  ================  ================  ================  ================
Train             1,511             43,362            1,367,658         68,947
Dev               200               2,304             93,535            7,186
Test              200               2,401             107,298           6,977
================  ================  ================  ================  ================

NLP Tools
---------

| The package also provide the following NLP tools.
| 我們的套件也提供了以下的自然語言處理工具。

* (WS) Word Segmentation 斷詞
* (POS) Part-of-Speech Tagging 詞性標記
* (NER) Named Entity Recognition 實體辨識

Installation
^^^^^^^^^^^^

``pip install -U ckip-transformers``

Requirements:

* `Python <https://www.python.org>`__ 3.6+
* `PyTorch <https://pytorch.org>`__ 1.5+
* `HuggingFace Transformers <https://huggingface.co/transformers/>`__ 3.5+

NLP Tools Usage
^^^^^^^^^^^^^^^

| See `here <../_api/ckip_transformers.html>`_ for API details.
| 詳細的 API 請參見 `此處 <../_api/ckip_transformers.html>`_ 。

| The complete script of this example is https://github.com/ckiplab/ckip-transformers/blob/master/example/example.py.
| 以下的範例的完整檔案可參見 https://github.com/ckiplab/ckip-transformers/blob/master/example/example.py 。

1. Import module
""""""""""""""""

.. code-block:: python

   from ckip_transformers.nlp import CkipWordSegmenter, CkipPosTagger, CkipNerChunker

2. Load models
""""""""""""""

| We provide several pretrained models for the NLP tools.
| 我們提供了一些適用於自然語言工具的預訓練的模型。

.. code-block:: python

   # Initialize drivers
   ws_driver  = CkipWordSegmenter(model="bert-base")
   pos_driver = CkipPosTagger(model="bert-base")
   ner_driver = CkipNerChunker(model="bert-base")

| One may also load their own checkpoints using our drivers.
| 也可以運用我們的工具於自己訓練的模型上。

.. code-block:: python

   # Initialize drivers with custom checkpoints
   ws_driver  = CkipWordSegmenter(model_name="path_to_your_model")
   pos_driver = CkipPosTagger(model_name="path_to_your_model")
   ner_driver = CkipNerChunker(model_name="path_to_your_model")

| To use GPU, one may specify device ID while initialize the drivers. Set to -1 (default) to disable GPU.
| 可於宣告斷詞等工具時指定 device 以使用 GPU,設為 -1 (預設值)代表不使用 GPU。

.. code-block:: python

   # Use CPU
   ws_driver = CkipWordSegmenter(device=-1)

   # Use GPU:0
   ws_driver = CkipWordSegmenter(device=0)

3. Run pipeline
"""""""""""""""

| The input for word segmentation and named-entity recognition must be a list of sentences.
| The input for part-of-speech tagging must be a list of list of words (the output of word segmentation).
| 斷詞與實體辨識的輸入必須是 list of sentences。
| 詞性標記的輸入必須是 list of list of words。

.. code-block:: python

   # Input text
   text = [
      "傅達仁今將執行安樂死,卻突然爆出自己20年前遭緯來體育台封殺,他不懂自己哪裡得罪到電視台。",
      "美國參議院針對今天總統布什所提名的勞工部長趙小蘭展開認可聽證會,預料她將會很順利通過參議院支持,成為該國有史以來第一位的華裔女性內閣成員。",
      "空白 也是可以的~",
   ]

   # Run pipeline
   ws  = ws_driver(text)
   pos = pos_driver(ws)
   ner = ner_driver(text)

| The POS driver will automatically segment the sentence internally using there characters ``',,。::;;!!??'`` while running the model. (The output sentences will be concatenated back.) You may set ``delim_set`` to any characters you want.
| You may set ``use_delim=False`` to disable this feature, or set ``use_delim=True`` in WS and NER driver to enable this feature.
| 詞性標記工具會自動用 ``',,。::;;!!??'`` 等字元在執行模型前切割句子(輸出的句子會自動接回)。可設定 ``delim_set`` 參數使用別的字元做切割。
| 另外可指定 ``use_delim=False`` 已停用此功能,或於斷詞、實體辨識時指定 ``use_delim=True`` 已啟用此功能。

.. code-block:: python

   # Enable sentence segmentation
   ws  = ws_driver(text, use_delim=True)
   ner = ner_driver(text, use_delim=True)

   # Disable sentence segmentation
   pos = pos_driver(ws, use_delim=False)

   # Use new line characters and tabs for sentence segmentation
   pos = pos_driver(ws, delim_set='\n\t')

| You may specify ``batch_size`` and ``max_length`` to better utilize you machine resources.
| 您亦可設置 ``batch_size`` 與 ``max_length`` 以更完美的利用您的機器資源。

.. code-block:: python

   # Sets the batch size and maximum sentence length
   ws = ws_driver(text, batch_size=256, max_length=128)

4. Show results
"""""""""""""""

.. code-block:: python

   # Pack word segmentation and part-of-speech results
   def pack_ws_pos_sentece(sentence_ws, sentence_pos):
      assert len(sentence_ws) == len(sentence_pos)
      res = []
      for word_ws, word_pos in zip(sentence_ws, sentence_pos):
         res.append(f"{word_ws}({word_pos})")
      return "\u3000".join(res)

   # Show results
   for sentence, sentence_ws, sentence_pos, sentence_ner in zip(text, ws, pos, ner):
      print(sentence)
      print(pack_ws_pos_sentece(sentence_ws, sentence_pos))
      for entity in sentence_ner:
         print(entity)
      print()

.. code-block:: text

   傅達仁今將執行安樂死,卻突然爆出自己20年前遭緯來體育台封殺,他不懂自己哪裡得罪到電視台。
   傅達仁(Nb) 今(Nd) 將(D) 執行(VC) 安樂死(Na) ,(COMMACATEGORY) 卻(D) 突然(D) 爆出(VJ) 自己(Nh) 20(Neu) 年(Nd) 前(Ng) 遭(P) 緯來(Nb) 體育台(Na) 封殺(VC) ,(COMMACATEGORY) 他(Nh) 不(D) 懂(VK) 自己(Nh) 哪裡(Ncd) 得罪到(VC) 電視台(Nc) 。(PERIODCATEGORY)
   NerToken(word='傅達仁', ner='PERSON', idx=(0, 3))
   NerToken(word='20年', ner='DATE', idx=(18, 21))
   NerToken(word='緯來體育台', ner='ORG', idx=(23, 28))

   美國參議院針對今天總統布什所提名的勞工部長趙小蘭展開認可聽證會,預料她將會很順利通過參議院支持,成為該國有史以來第一位的華裔女性內閣成員。
   美國(Nc) 參議院(Nc) 針對(P) 今天(Nd) 總統(Na) 布什(Nb) 所(D) 提名(VC) 的(DE) 勞工部長(Na) 趙小蘭(Nb) 展開(VC) 認可(VC) 聽證會(Na) ,(COMMACATEGORY) 預料(VE) 她(Nh) 將(D) 會(D) 很(Dfa) 順利(VH) 通過(VC) 參議院(Nc) 支持(VC) ,(COMMACATEGORY) 成為(VG) 該(Nes) 國(Nc) 有史以來(D) 第一(Neu) 位(Nf) 的(DE) 華裔(Na) 女性(Na) 內閣(Na) 成員(Na) 。(PERIODCATEGORY)
   NerToken(word='美國參議院', ner='ORG', idx=(0, 5))
   NerToken(word='今天', ner='LOC', idx=(7, 9))
   NerToken(word='布什', ner='PERSON', idx=(11, 13))
   NerToken(word='勞工部長', ner='ORG', idx=(17, 21))
   NerToken(word='趙小蘭', ner='PERSON', idx=(21, 24))
   NerToken(word='認可聽證會', ner='EVENT', idx=(26, 31))
   NerToken(word='參議院', ner='ORG', idx=(42, 45))
   NerToken(word='第一', ner='ORDINAL', idx=(56, 58))
   NerToken(word='華裔', ner='NORP', idx=(60, 62))

   空白 也是可以的~
   空白(VH)  (WHITESPACE) 也(D) 是(SHI) 可以(VH) 的(T) ~(FW)

NLP Tools Performance
^^^^^^^^^^^^^^^^^^^^^

| The following is a performance comparison between our tool and other tools.
| 以下是我們的工具與其他的工具之性能比較。

CKIP Transformers v.s. Monpa & Jeiba
""""""""""""""""""""""""""""""""""""

========================  ===========  =============  ===============  ============
Tool                        WS (F1)      POS (Acc)      WS+POS (F1)      NER (F1)
========================  ===========  =============  ===============  ============
CKIP BERT Base            **97.60%**   **95.67%**     **94.19%**       **81.18%**
CKIP ALBERT Base            97.33%       95.30%         93.52%           79.47%
CKIP BERT Tiny              96.98%       95.08%         93.13%           74.20%
CKIP ALBERT Tiny            96.66%       94.48%         92.25%           71.17%
------------------------  -----------  -------------  ---------------  ------------

------------------------  -----------  -------------  ---------------  ------------
Monpa†                      92.58%       --             83.88%           --
Jeiba                       81.18%       --             --               --
========================  ===========  =============  ===============  ============

| † Monpa provides only 3 types of tags in NER.
| † Monpa 的實體辨識僅提供三種標記而已。

CKIP Transformers v.s. CkipTagger
""""""""""""""""""""""""""""""""""""

| The following results are tested on a different dataset.†
| 以下實驗在另一個資料集測試。†

========================  ===========  =============  ===============  ============
Tool                        WS (F1)      POS (Acc)      WS+POS (F1)      NER (F1)
========================  ===========  =============  ===============  ============
CKIP BERT Base            **97.84%**     96.46%       **94.91%**       **79.20%**
CkipTagger                  97.33%     **97.20%**       94.75%           77.87%
========================  ===========  =============  ===============  ============

| † Here we retrained/tested our BERT model using the same dataset with CkipTagger.
| † 我們重新訓練/測試我們的 BERT 模型於跟 CkipTagger 相同的資料集。

License
-------

|GPL-3.0|

Copyright (c) 2023 `CKIP Lab <https://ckip.iis.sinica.edu.tw>`__ under the `GPL-3.0 License <https://www.gnu.org/licenses/gpl-3.0.html>`__.

.. |GPL-3.0| image:: https://www.gnu.org/graphics/gplv3-with-text-136x68.png
   :target: https://www.gnu.org/licenses/gpl-3.0.html



            

Raw data

            {
    "_id": null,
    "home_page": "https://ckip-transformers.readthedocs.io",
    "name": "ckip-transformers",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "",
    "keywords": "",
    "author": "Mu Yang",
    "author_email": "emfomy@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/9a/31/5c34c19ae6a562a0319d95a9fa03ed674e5daf31618db59d6293838497f3/ckip-transformers-0.3.4.tar.gz",
    "platform": "linux_x86_64",
    "description": "CKIP Transformers\n-----------------\n\n| This project provides traditional Chinese transformers models (including ALBERT, BERT, GPT2) and NLP tools (including word segmentation, part-of-speech tagging, named entity recognition).\n| \u9019\u500b\u5c08\u6848\u63d0\u4f9b\u4e86\u7e41\u9ad4\u4e2d\u6587\u7684 transformers \u6a21\u578b\uff08\u5305\u542b ALBERT\u3001BERT\u3001GPT2\uff09\u53ca\u81ea\u7136\u8a9e\u8a00\u8655\u7406\u5de5\u5177\uff08\u5305\u542b\u65b7\u8a5e\u3001\u8a5e\u6027\u6a19\u8a18\u3001\u5be6\u9ad4\u8fa8\u8b58\uff09\u3002\n\nGit\n^^^\n\n| https://github.com/ckiplab/ckip-transformers\n| |GitHub Version| |GitHub License| |GitHub Release| |GitHub Issues|\n\n.. |GitHub Version| image:: https://img.shields.io/github/v/release/ckiplab/ckip-transformers.svg?cacheSeconds=3600\n   :target: https://github.com/ckiplab/ckip-transformers/releases\n\n.. |GitHub License| image:: https://img.shields.io/github/license/ckiplab/ckip-transformers.svg?cacheSeconds=3600\n   :target: https://github.com/ckiplab/ckip-transformers/blob/master/LICENSE\n\n.. |GitHub Release| image:: https://img.shields.io/github/release-date/ckiplab/ckip-transformers.svg?cacheSeconds=3600\n\n.. |GitHub Downloads| image:: https://img.shields.io/github/downloads/ckiplab/ckip-transformers/total.svg?cacheSeconds=3600\n   :target: https://github.com/ckiplab/ckip-transformers/releases/latest\n\n.. |GitHub Issues| image:: https://img.shields.io/github/issues/ckiplab/ckip-transformers.svg?cacheSeconds=3600\n   :target: https://github.com/ckiplab/ckip-transformers/issues\n\n.. |GitHub Forks| image:: https://img.shields.io/github/forks/ckiplab/ckip-transformers.svg?style=social&label=Fork&cacheSeconds=3600\n\n.. |GitHub Stars| image:: https://img.shields.io/github/stars/ckiplab/ckip-transformers.svg?style=social&label=Star&cacheSeconds=3600\n\n.. |GitHub Watchers| image:: https://img.shields.io/github/watchers/ckiplab/ckip-transformers.svg?style=social&label=Watch&cacheSeconds=3600\n\nPyPI\n^^^^\n\n| https://pypi.org/project/ckip-transformers\n| |PyPI Version| |PyPI License| |PyPI Downloads| |PyPI Python| |PyPI Implementation| |PyPI Format| |PyPI Status|\n\n.. |PyPI Version| image:: https://img.shields.io/pypi/v/ckip-transformers.svg?cacheSeconds=3600\n   :target: https://pypi.org/project/ckip-transformers\n\n.. |PyPI License| image:: https://img.shields.io/pypi/l/ckip-transformers.svg?cacheSeconds=3600\n   :target: https://github.com/ckiplab/ckip-transformers/blob/master/LICENSE\n\n.. |PyPI Downloads| image:: https://img.shields.io/pypi/dm/ckip-transformers.svg?cacheSeconds=3600\n   :target: https://pypi.org/project/ckip-transformers#files\n\n.. |PyPI Python| image:: https://img.shields.io/pypi/pyversions/ckip-transformers.svg?cacheSeconds=3600\n\n.. |PyPI Implementation| image:: https://img.shields.io/pypi/implementation/ckip-transformers.svg?cacheSeconds=3600\n\n.. |PyPI Format| image:: https://img.shields.io/pypi/format/ckip-transformers.svg?cacheSeconds=3600\n\n.. |PyPI Status| image:: https://img.shields.io/pypi/status/ckip-transformers.svg?cacheSeconds=3600\n\nDocumentation\n^^^^^^^^^^^^^\n\n| https://ckip-transformers.readthedocs.io\n| |ReadTheDocs Home|\n\n.. |ReadTheDocs Home| image:: https://img.shields.io/website/https/ckip-transformers.readthedocs.io.svg?cacheSeconds=3600&up_message=online&down_message=offline\n   :target: https://ckip-transformers.readthedocs.io\n\nDemo\n^^^^\n\n| https://ckip.iis.sinica.edu.tw/service/transformers\n| |Transformers Demo|\n\n.. |Transformers Demo| image:: https://img.shields.io/website/https/ckip.iis.sinica.edu.tw/service/transformers.svg?cacheSeconds=3600&up_message=online&down_message=offline\n   :target: https://ckip.iis.sinica.edu.tw/service/transformers\n\nContributers\n^^^^^^^^^^^^\n\n* `Mu Yang <https://muyang.pro>`__ at `CKIP <https://ckip.iis.sinica.edu.tw>`__ (Author & Maintainer).\n* `Wei-Yun Ma <https://www.iis.sinica.edu.tw/pages/ma/>`__ at `CKIP <https://ckip.iis.sinica.edu.tw>`__ (Maintainer).\n\nRelated Packages\n^^^^^^^^^^^^^^^^\n\n- `CkipTagger <https://github.com/ckiplab/ckiptagger>`_: An alternative Chinese NLP library with using BiLSTM.\n- `CKIP CoreNLP Toolkit <https://github.com/ckiplab/ckipnlp>`_: A Chinese NLP library with more NLP tasks and utilities.\n\nModels\n------\n\n| You may also use our pretrained models with HuggingFace transformers library directly: https://huggingface.co/ckiplab/.\n| \u60a8\u53ef\u65bc https://huggingface.co/ckiplab/ \u4e0b\u8f09\u9810\u8a13\u7df4\u7684\u6a21\u578b\u3002\n\n- Language Models\n   * `ALBERT Tiny <https://huggingface.co/ckiplab/albert-tiny-chinese>`_: ``ckiplab/albert-tiny-chinese``\n   * `ALBERT Base <https://huggingface.co/ckiplab/albert-base-chinese>`_: ``ckiplab/albert-base-chinese``\n   * `BERT Tiny <https://huggingface.co/ckiplab/bert-tiny-chinese>`_: ``ckiplab/bert-tiny-chinese``\n   * `BERT Base <https://huggingface.co/ckiplab/bert-base-chinese>`_: ``ckiplab/bert-base-chinese``\n   * `GPT2 Tiny <https://huggingface.co/ckiplab/gpt2-tiny-chinese>`_: ``ckiplab/gpt2-tiny-chinese``\n   * `GPT2 Base <https://huggingface.co/ckiplab/gpt2-base-chinese>`_: ``ckiplab/gpt2-base-chinese``\n\n- NLP Task Models\n   * `ALBERT Tiny \u2014 Word Segmentation <https://huggingface.co/ckiplab/albert-tiny-chinese-ws>`_: ``ckiplab/albert-tiny-chinese-ws``\n   * `ALBERT Tiny \u2014 Part-of-Speech Tagging <https://huggingface.co/ckiplab/albert-tiny-chinese-pos>`_: ``ckiplab/albert-tiny-chinese-pos``\n   * `ALBERT Tiny \u2014 Named-Entity Recognition <https://huggingface.co/ckiplab/albert-tiny-chinese-ner>`_: ``ckiplab/albert-tiny-chinese-ner``\n   * `ALBERT Base \u2014 Word Segmentation <https://huggingface.co/ckiplab/albert-base-chinese-ws>`_: ``ckiplab/albert-base-chinese-ws``\n   * `ALBERT Base \u2014 Part-of-Speech Tagging <https://huggingface.co/ckiplab/albert-base-chinese-pos>`_: ``ckiplab/albert-base-chinese-pos``\n   * `ALBERT Base \u2014 Named-Entity Recognition <https://huggingface.co/ckiplab/albert-base-chinese-ner>`_: ``ckiplab/albert-base-chinese-ner``\n   * `BERT Tiny \u2014 Word Segmentation <https://huggingface.co/ckiplab/bert-tiny-chinese-ws>`_: ``ckiplab/bert-tiny-chinese-ws``\n   * `BERT Tiny \u2014 Part-of-Speech Tagging <https://huggingface.co/ckiplab/bert-tiny-chinese-pos>`_: ``ckiplab/bert-tiny-chinese-pos``\n   * `BERT Tiny \u2014 Named-Entity Recognition <https://huggingface.co/ckiplab/bert-tiny-chinese-ner>`_: ``ckiplab/bert-tiny-chinese-ner``\n   * `BERT Base \u2014 Word Segmentation <https://huggingface.co/ckiplab/bert-base-chinese-ws>`_: ``ckiplab/bert-base-chinese-ws``\n   * `BERT Base \u2014 Part-of-Speech Tagging <https://huggingface.co/ckiplab/bert-base-chinese-pos>`_: ``ckiplab/bert-base-chinese-pos``\n   * `BERT Base \u2014 Named-Entity Recognition <https://huggingface.co/ckiplab/bert-base-chinese-ner>`_: ``ckiplab/bert-base-chinese-ner``\n\nModel Usage\n^^^^^^^^^^^\n\n| You may use our model directly from the HuggingFace's transformers library.\n| \u60a8\u53ef\u76f4\u63a5\u900f\u904e HuggingFace's transformers \u5957\u4ef6\u4f7f\u7528\u6211\u5011\u7684\u6a21\u578b\u3002\n\n.. code-block:: bash\n\n   pip install -U transformers\n\n| Please use BertTokenizerFast as tokenizer, and replace ``ckiplab/albert-tiny-chinese`` and ``ckiplab/albert-tiny-chinese-ws`` by any model you need in the following example.\n| \u8acb\u4f7f\u7528\u5167\u5efa\u7684 BertTokenizerFast\uff0c\u4e26\u5c07\u4ee5\u4e0b\u7bc4\u4f8b\u4e2d\u7684 ``ckiplab/albert-tiny-chinese`` \u8207 ``ckiplab/albert-tiny-chinese-ws`` \u66ff\u63db\u6210\u4efb\u4f55\u60a8\u8981\u4f7f\u7528\u7684\u6a21\u578b\u540d\u7a31\u3002\n\n.. code-block:: python\n\n   from transformers import (\n      BertTokenizerFast,\n      AutoModelForMaskedLM,\n      AutoModelForCausalLM,\n      AutoModelForTokenClassification,\n   )\n\n   # masked language model (ALBERT, BERT)\n   tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')\n   model = AutoModelForMaskedLM.from_pretrained('ckiplab/albert-tiny-chinese') # or other models above\n\n   # casual language model (GPT2)\n   tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')\n   model = AutoModelForCausalLM.from_pretrained('ckiplab/gpt2-base-chinese') # or other models above\n\n   # nlp task model\n   tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')\n   model = AutoModelForTokenClassification.from_pretrained('ckiplab/albert-tiny-chinese-ws') # or other models above\n\nModel Fine-Tunning\n^^^^^^^^^^^^^^^^^^\n\n| To fine tunning our model on your own datasets, please refer to the following example from HuggingFace's transformers.\n| \u60a8\u53ef\u53c3\u8003\u4ee5\u4e0b\u7684\u7bc4\u4f8b\u53bb\u5fae\u8abf\u6211\u5011\u7684\u6a21\u578b\u65bc\u60a8\u81ea\u5df1\u7684\u8cc7\u6599\u96c6\u3002\n\n- https://github.com/huggingface/transformers/tree/master/examples\n- https://github.com/huggingface/transformers/tree/master/examples/pytorch/language-modeling\n- https://github.com/huggingface/transformers/tree/master/examples/pytorch/token-classification\n\n| Remember to set ``--tokenizer_name bert-base-chinese`` in order to use Chinese tokenizer.\n| \u8a18\u5f97\u8a2d\u7f6e ``--tokenizer_name bert-base-chinese`` \u4ee5\u6b63\u78ba\u7684\u4f7f\u7528\u4e2d\u6587\u7684 tokenizer\u3002\n\n.. code-block:: bash\n\n   python run_mlm.py \\\n      --model_name_or_path ckiplab/albert-tiny-chinese \\ # or other models above\n      --tokenizer_name bert-base-chinese \\\n      ...\n\n   python run_ner.py \\\n      --model_name_or_path ckiplab/albert-tiny-chinese-ws \\ # or other models above\n      --tokenizer_name bert-base-chinese \\\n      ...\n\nModel Performance\n^^^^^^^^^^^^^^^^^\n\n| The following is a performance comparison between our model and other models.\n| The results are tested on a traditional Chinese corpus.\n| \u4ee5\u4e0b\u662f\u6211\u5011\u7684\u6a21\u578b\u8207\u5176\u4ed6\u7684\u6a21\u578b\u4e4b\u6027\u80fd\u6bd4\u8f03\u3002\n| \u5404\u500b\u4efb\u52d9\u7686\u6e2c\u8a66\u65bc\u7e41\u9ad4\u4e2d\u6587\u7684\u6e2c\u8a66\u96c6\u3002\n\n================================  ===========  ===========  ========  ==========  =========\nModel                             #Parameters  Perplexity\u2020  WS (F1)\u2021  POS (ACC)\u2021  NER (F1)\u2021\n================================  ===========  ===========  ========  ==========  =========\nckiplab/albert-tiny-chinese         4M          4.80        96.66%    94.48%      71.17%\nckiplab/albert-base-chinese        11M          2.65        97.33%    95.30%      79.47%\nckiplab/bert-tiny-chinese          12M          8.07        96.98%    95.11%      74.21%\nckiplab/bert-base-chinese         102M          1.88        97.60%    95.67%      81.18%\nckiplab/gpt2-tiny-chinese           4M         16.94        --        --          --\nckiplab/gpt2-base-chinese         102M          8.36        --        --          --\n--------------------------------  -----------  -----------  --------  ----------  ---------\n\n--------------------------------  -----------  -----------  --------  ----------  ---------\nvoidful/albert_chinese_tiny         4M         74.93        --        --          --\nvoidful/albert_chinese_base        11M         22.34        --        --          --\nbert-base-chinese                 102M          2.53        --        --          --\n================================  ===========  ===========  ========  ==========  =========\n\n| \u2020 Perplexity; the smaller the better.\n| \u2020 \u6df7\u6dc6\u5ea6\uff1b\u6578\u5b57\u8d8a\u5c0f\u8d8a\u597d\u3002\n| \u2021 WS: word segmentation; POS: part-of-speech; NER: named-entity recognition; the larger the better.\n| \u2021 WS: \u65b7\u8a5e\uff1bPOS: \u8a5e\u6027\u6a19\u8a18\uff1bNER: \u5be6\u9ad4\u8fa8\u8b58\uff1b\u6578\u5b57\u8d8a\u5927\u8d8a\u597d\u3002\n\nTraining Corpus\n^^^^^^^^^^^^^^^\n\n| The language models are trained on the ZhWiki and CNA datasets; the WS and POS tasks are trained on the ASBC dataset; the NER tasks are trained on the OntoNotes dataset.\n| \u4ee5\u4e0a\u7684\u8a9e\u8a00\u6a21\u578b\u8a13\u7df4\u65bc ZhWiki \u8207 CNA \u8cc7\u6599\u96c6\u4e0a\uff1b\u65b7\u8a5e\uff08WS\uff09\u8207\u8a5e\u6027\u6a19\u8a18\uff08POS\uff09\u4efb\u52d9\u6a21\u578b\u8a13\u7df4\u65bc ASBC \u8cc7\u6599\u96c6\u4e0a\uff1b\u5be6\u9ad4\u8fa8\u8b58\uff08NER\uff09\u4efb\u52d9\u6a21\u578b\u8a13\u7df4\u65bc OntoNotes \u8cc7\u6599\u96c6\u4e0a\u3002\n\n* ZhWiki: https://dumps.wikimedia.org/zhwiki/\n   | Chinese Wikipedia text (20200801 dump), translated to Traditional using `OpenCC <https://github.com/BYVoid/OpenCC>`_.\n   | \u4e2d\u6587\u7dad\u57fa\u7684\u6587\u7ae0\uff0820200801 \u7248\u672c\uff09\uff0c\u5229\u7528 `OpenCC <https://github.com/BYVoid/OpenCC>`_ \u7ffb\u8b6f\u6210\u7e41\u9ad4\u4e2d\u6587\u3002\n* CNA: https://catalog.ldc.upenn.edu/LDC2011T13\n   | Chinese Gigaword Fifth Edition \u2014 CNA (Central News Agency) part.\n   | \u4e2d\u6587 Gigaword \u7b2c\u4e94\u7248 \u2014 CNA\uff08\u4e2d\u592e\u793e\uff09\u7684\u90e8\u5206\u3002\n* ASBC: http://asbc.iis.sinica.edu.tw\n   | Academia Sinica Balanced Corpus of Modern Chinese release 4.0.\n   | \u4e2d\u592e\u7814\u7a76\u9662\u6f22\u8a9e\u5e73\u8861\u8a9e\u6599\u5eab\u7b2c\u56db\u7248\u3002\n* OntoNotes: https://catalog.ldc.upenn.edu/LDC2013T19\n   | OntoNotes release 5.0, Chinese part, translated to Traditional using `OpenCC <https://github.com/BYVoid/OpenCC>`_.\n   | OntoNotes \u7b2c\u4e94\u7248\uff0c\u4e2d\u6587\u90e8\u5206\uff0c\u5229\u7528 `OpenCC <https://github.com/BYVoid/OpenCC>`_ \u7ffb\u8b6f\u6210\u7e41\u9ad4\u4e2d\u6587\u3002\n\n| Here is a summary of each corpus.\n| \u4ee5\u4e0b\u662f\u5404\u500b\u8cc7\u6599\u96c6\u7684\u4e00\u89bd\u8868\u3002\n\n================  ================  ================  ================  ================\nDataset           #Documents        #Lines            #Characters       Line Type\n================  ================  ================  ================  ================\nCNA               2,559,520         13,532,445        1,219,029,974     Paragraph\nZhWiki            1,106,783         5,918,975         495,446,829       Paragraph\nASBC              19,247            1,395,949         17,572,374        Clause\nOntoNotes         1,911             48,067            1,568,491         Sentence\n================  ================  ================  ================  ================\n\n| Here is the dataset split used for language models.\n| \u4ee5\u4e0b\u662f\u7528\u65bc\u8a13\u7df4\u8a9e\u8a00\u6a21\u578b\u7684\u8cc7\u6599\u96c6\u5207\u5272\u3002\n\n================  ================  ================  ================\nCNA+ZhWiki        #Documents        #Lines            #Characters\n================  ================  ================  ================\nTrain             3,606,303         18,986,238        4,347,517,682\nDev               30,000            148,077           32,888,978\nTest              30,000            151,241           35,216,818\n================  ================  ================  ================\n\n| Here is the dataset split used for word segmentation and part-of-speech tagging models.\n| \u4ee5\u4e0b\u662f\u7528\u65bc\u8a13\u7df4\u65b7\u8a5e\u53ca\u8a5e\u6027\u6a19\u8a18\u6a21\u578b\u7684\u8cc7\u6599\u96c6\u5207\u5272\u3002\n\n================  ================  ================  ================  ================\nASBC              #Documents        #Lines            #Words            #Characters\n================  ================  ================  ================  ================\nTrain             15,247            1,183,260         9,480,899         14,724,250\nDev               2,000             52,677            448,964           741,323\nTest              2,000             160,012           1,315,129         2,106,799\n================  ================  ================  ================  ================\n\n\n| Here is the dataset split used for word segmentation and named entity recognition models.\n| \u4ee5\u4e0b\u662f\u7528\u65bc\u8a13\u7df4\u5be6\u9ad4\u8fa8\u8b58\u6a21\u578b\u7684\u8cc7\u6599\u96c6\u5207\u5272\u3002\n\n================  ================  ================  ================  ================\nOntoNotes         #Documents        #Lines            #Characters       #Named-Entities\n================  ================  ================  ================  ================\nTrain             1,511             43,362            1,367,658         68,947\nDev               200               2,304             93,535            7,186\nTest              200               2,401             107,298           6,977\n================  ================  ================  ================  ================\n\nNLP Tools\n---------\n\n| The package also provide the following NLP tools.\n| \u6211\u5011\u7684\u5957\u4ef6\u4e5f\u63d0\u4f9b\u4e86\u4ee5\u4e0b\u7684\u81ea\u7136\u8a9e\u8a00\u8655\u7406\u5de5\u5177\u3002\n\n* (WS) Word Segmentation \u65b7\u8a5e\n* (POS) Part-of-Speech Tagging \u8a5e\u6027\u6a19\u8a18\n* (NER) Named Entity Recognition \u5be6\u9ad4\u8fa8\u8b58\n\nInstallation\n^^^^^^^^^^^^\n\n``pip install -U ckip-transformers``\n\nRequirements:\n\n* `Python <https://www.python.org>`__ 3.6+\n* `PyTorch <https://pytorch.org>`__ 1.5+\n* `HuggingFace Transformers <https://huggingface.co/transformers/>`__ 3.5+\n\nNLP Tools Usage\n^^^^^^^^^^^^^^^\n\n| See `here <../_api/ckip_transformers.html>`_ for API details.\n| \u8a73\u7d30\u7684 API \u8acb\u53c3\u898b `\u6b64\u8655 <../_api/ckip_transformers.html>`_ \u3002\n\n| The complete script of this example is https://github.com/ckiplab/ckip-transformers/blob/master/example/example.py.\n| \u4ee5\u4e0b\u7684\u7bc4\u4f8b\u7684\u5b8c\u6574\u6a94\u6848\u53ef\u53c3\u898b https://github.com/ckiplab/ckip-transformers/blob/master/example/example.py \u3002\n\n1. Import module\n\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\n\n.. code-block:: python\n\n   from ckip_transformers.nlp import CkipWordSegmenter, CkipPosTagger, CkipNerChunker\n\n2. Load models\n\"\"\"\"\"\"\"\"\"\"\"\"\"\"\n\n| We provide several pretrained models for the NLP tools.\n| \u6211\u5011\u63d0\u4f9b\u4e86\u4e00\u4e9b\u9069\u7528\u65bc\u81ea\u7136\u8a9e\u8a00\u5de5\u5177\u7684\u9810\u8a13\u7df4\u7684\u6a21\u578b\u3002\n\n.. code-block:: python\n\n   # Initialize drivers\n   ws_driver  = CkipWordSegmenter(model=\"bert-base\")\n   pos_driver = CkipPosTagger(model=\"bert-base\")\n   ner_driver = CkipNerChunker(model=\"bert-base\")\n\n| One may also load their own checkpoints using our drivers.\n| \u4e5f\u53ef\u4ee5\u904b\u7528\u6211\u5011\u7684\u5de5\u5177\u65bc\u81ea\u5df1\u8a13\u7df4\u7684\u6a21\u578b\u4e0a\u3002\n\n.. code-block:: python\n\n   # Initialize drivers with custom checkpoints\n   ws_driver  = CkipWordSegmenter(model_name=\"path_to_your_model\")\n   pos_driver = CkipPosTagger(model_name=\"path_to_your_model\")\n   ner_driver = CkipNerChunker(model_name=\"path_to_your_model\")\n\n| To use GPU, one may specify device ID while initialize the drivers. Set to -1 (default) to disable GPU.\n| \u53ef\u65bc\u5ba3\u544a\u65b7\u8a5e\u7b49\u5de5\u5177\u6642\u6307\u5b9a device \u4ee5\u4f7f\u7528 GPU\uff0c\u8a2d\u70ba -1 \uff08\u9810\u8a2d\u503c\uff09\u4ee3\u8868\u4e0d\u4f7f\u7528 GPU\u3002\n\n.. code-block:: python\n\n   # Use CPU\n   ws_driver = CkipWordSegmenter(device=-1)\n\n   # Use GPU:0\n   ws_driver = CkipWordSegmenter(device=0)\n\n3. Run pipeline\n\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\n\n| The input for word segmentation and named-entity recognition must be a list of sentences.\n| The input for part-of-speech tagging must be a list of list of words (the output of word segmentation).\n| \u65b7\u8a5e\u8207\u5be6\u9ad4\u8fa8\u8b58\u7684\u8f38\u5165\u5fc5\u9808\u662f list of sentences\u3002\n| \u8a5e\u6027\u6a19\u8a18\u7684\u8f38\u5165\u5fc5\u9808\u662f list of list of words\u3002\n\n.. code-block:: python\n\n   # Input text\n   text = [\n      \"\u5085\u9054\u4ec1\u4eca\u5c07\u57f7\u884c\u5b89\u6a02\u6b7b\uff0c\u537b\u7a81\u7136\u7206\u51fa\u81ea\u5df120\u5e74\u524d\u906d\u7def\u4f86\u9ad4\u80b2\u53f0\u5c01\u6bba\uff0c\u4ed6\u4e0d\u61c2\u81ea\u5df1\u54ea\u88e1\u5f97\u7f6a\u5230\u96fb\u8996\u53f0\u3002\",\n      \"\u7f8e\u570b\u53c3\u8b70\u9662\u91dd\u5c0d\u4eca\u5929\u7e3d\u7d71\u5e03\u4ec0\u6240\u63d0\u540d\u7684\u52de\u5de5\u90e8\u9577\u8d99\u5c0f\u862d\u5c55\u958b\u8a8d\u53ef\u807d\u8b49\u6703\uff0c\u9810\u6599\u5979\u5c07\u6703\u5f88\u9806\u5229\u901a\u904e\u53c3\u8b70\u9662\u652f\u6301\uff0c\u6210\u70ba\u8a72\u570b\u6709\u53f2\u4ee5\u4f86\u7b2c\u4e00\u4f4d\u7684\u83ef\u88d4\u5973\u6027\u5167\u95a3\u6210\u54e1\u3002\",\n      \"\u7a7a\u767d \u4e5f\u662f\u53ef\u4ee5\u7684\uff5e\",\n   ]\n\n   # Run pipeline\n   ws  = ws_driver(text)\n   pos = pos_driver(ws)\n   ner = ner_driver(text)\n\n| The POS driver will automatically segment the sentence internally using there characters ``'\uff0c,\u3002\uff1a:\uff1b;\uff01!\uff1f?'`` while running the model. (The output sentences will be concatenated back.) You may set ``delim_set`` to any characters you want.\n| You may set ``use_delim=False`` to disable this feature, or set ``use_delim=True`` in WS and NER driver to enable this feature.\n| \u8a5e\u6027\u6a19\u8a18\u5de5\u5177\u6703\u81ea\u52d5\u7528 ``'\uff0c,\u3002\uff1a:\uff1b;\uff01!\uff1f?'`` \u7b49\u5b57\u5143\u5728\u57f7\u884c\u6a21\u578b\u524d\u5207\u5272\u53e5\u5b50\uff08\u8f38\u51fa\u7684\u53e5\u5b50\u6703\u81ea\u52d5\u63a5\u56de\uff09\u3002\u53ef\u8a2d\u5b9a ``delim_set`` \u53c3\u6578\u4f7f\u7528\u5225\u7684\u5b57\u5143\u505a\u5207\u5272\u3002\n| \u53e6\u5916\u53ef\u6307\u5b9a ``use_delim=False`` \u5df2\u505c\u7528\u6b64\u529f\u80fd\uff0c\u6216\u65bc\u65b7\u8a5e\u3001\u5be6\u9ad4\u8fa8\u8b58\u6642\u6307\u5b9a ``use_delim=True`` \u5df2\u555f\u7528\u6b64\u529f\u80fd\u3002\n\n.. code-block:: python\n\n   # Enable sentence segmentation\n   ws  = ws_driver(text, use_delim=True)\n   ner = ner_driver(text, use_delim=True)\n\n   # Disable sentence segmentation\n   pos = pos_driver(ws, use_delim=False)\n\n   # Use new line characters and tabs for sentence segmentation\n   pos = pos_driver(ws, delim_set='\\n\\t')\n\n| You may specify ``batch_size`` and ``max_length`` to better utilize you machine resources.\n| \u60a8\u4ea6\u53ef\u8a2d\u7f6e ``batch_size`` \u8207 ``max_length`` \u4ee5\u66f4\u5b8c\u7f8e\u7684\u5229\u7528\u60a8\u7684\u6a5f\u5668\u8cc7\u6e90\u3002\n\n.. code-block:: python\n\n   # Sets the batch size and maximum sentence length\n   ws = ws_driver(text, batch_size=256, max_length=128)\n\n4. Show results\n\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\n\n.. code-block:: python\n\n   # Pack word segmentation and part-of-speech results\n   def pack_ws_pos_sentece(sentence_ws, sentence_pos):\n      assert len(sentence_ws) == len(sentence_pos)\n      res = []\n      for word_ws, word_pos in zip(sentence_ws, sentence_pos):\n         res.append(f\"{word_ws}({word_pos})\")\n      return \"\\u3000\".join(res)\n\n   # Show results\n   for sentence, sentence_ws, sentence_pos, sentence_ner in zip(text, ws, pos, ner):\n      print(sentence)\n      print(pack_ws_pos_sentece(sentence_ws, sentence_pos))\n      for entity in sentence_ner:\n         print(entity)\n      print()\n\n.. code-block:: text\n\n   \u5085\u9054\u4ec1\u4eca\u5c07\u57f7\u884c\u5b89\u6a02\u6b7b\uff0c\u537b\u7a81\u7136\u7206\u51fa\u81ea\u5df120\u5e74\u524d\u906d\u7def\u4f86\u9ad4\u80b2\u53f0\u5c01\u6bba\uff0c\u4ed6\u4e0d\u61c2\u81ea\u5df1\u54ea\u88e1\u5f97\u7f6a\u5230\u96fb\u8996\u53f0\u3002\n   \u5085\u9054\u4ec1(Nb)\u3000\u4eca(Nd)\u3000\u5c07(D)\u3000\u57f7\u884c(VC)\u3000\u5b89\u6a02\u6b7b(Na)\u3000\uff0c(COMMACATEGORY)\u3000\u537b(D)\u3000\u7a81\u7136(D)\u3000\u7206\u51fa(VJ)\u3000\u81ea\u5df1(Nh)\u300020(Neu)\u3000\u5e74(Nd)\u3000\u524d(Ng)\u3000\u906d(P)\u3000\u7def\u4f86(Nb)\u3000\u9ad4\u80b2\u53f0(Na)\u3000\u5c01\u6bba(VC)\u3000\uff0c(COMMACATEGORY)\u3000\u4ed6(Nh)\u3000\u4e0d(D)\u3000\u61c2(VK)\u3000\u81ea\u5df1(Nh)\u3000\u54ea\u88e1(Ncd)\u3000\u5f97\u7f6a\u5230(VC)\u3000\u96fb\u8996\u53f0(Nc)\u3000\u3002(PERIODCATEGORY)\n   NerToken(word='\u5085\u9054\u4ec1', ner='PERSON', idx=(0, 3))\n   NerToken(word='20\u5e74', ner='DATE', idx=(18, 21))\n   NerToken(word='\u7def\u4f86\u9ad4\u80b2\u53f0', ner='ORG', idx=(23, 28))\n\n   \u7f8e\u570b\u53c3\u8b70\u9662\u91dd\u5c0d\u4eca\u5929\u7e3d\u7d71\u5e03\u4ec0\u6240\u63d0\u540d\u7684\u52de\u5de5\u90e8\u9577\u8d99\u5c0f\u862d\u5c55\u958b\u8a8d\u53ef\u807d\u8b49\u6703\uff0c\u9810\u6599\u5979\u5c07\u6703\u5f88\u9806\u5229\u901a\u904e\u53c3\u8b70\u9662\u652f\u6301\uff0c\u6210\u70ba\u8a72\u570b\u6709\u53f2\u4ee5\u4f86\u7b2c\u4e00\u4f4d\u7684\u83ef\u88d4\u5973\u6027\u5167\u95a3\u6210\u54e1\u3002\n   \u7f8e\u570b(Nc)\u3000\u53c3\u8b70\u9662(Nc)\u3000\u91dd\u5c0d(P)\u3000\u4eca\u5929(Nd)\u3000\u7e3d\u7d71(Na)\u3000\u5e03\u4ec0(Nb)\u3000\u6240(D)\u3000\u63d0\u540d(VC)\u3000\u7684(DE)\u3000\u52de\u5de5\u90e8\u9577(Na)\u3000\u8d99\u5c0f\u862d(Nb)\u3000\u5c55\u958b(VC)\u3000\u8a8d\u53ef(VC)\u3000\u807d\u8b49\u6703(Na)\u3000\uff0c(COMMACATEGORY)\u3000\u9810\u6599(VE)\u3000\u5979(Nh)\u3000\u5c07(D)\u3000\u6703(D)\u3000\u5f88(Dfa)\u3000\u9806\u5229(VH)\u3000\u901a\u904e(VC)\u3000\u53c3\u8b70\u9662(Nc)\u3000\u652f\u6301(VC)\u3000\uff0c(COMMACATEGORY)\u3000\u6210\u70ba(VG)\u3000\u8a72(Nes)\u3000\u570b(Nc)\u3000\u6709\u53f2\u4ee5\u4f86(D)\u3000\u7b2c\u4e00(Neu)\u3000\u4f4d(Nf)\u3000\u7684(DE)\u3000\u83ef\u88d4(Na)\u3000\u5973\u6027(Na)\u3000\u5167\u95a3(Na)\u3000\u6210\u54e1(Na)\u3000\u3002(PERIODCATEGORY)\n   NerToken(word='\u7f8e\u570b\u53c3\u8b70\u9662', ner='ORG', idx=(0, 5))\n   NerToken(word='\u4eca\u5929', ner='LOC', idx=(7, 9))\n   NerToken(word='\u5e03\u4ec0', ner='PERSON', idx=(11, 13))\n   NerToken(word='\u52de\u5de5\u90e8\u9577', ner='ORG', idx=(17, 21))\n   NerToken(word='\u8d99\u5c0f\u862d', ner='PERSON', idx=(21, 24))\n   NerToken(word='\u8a8d\u53ef\u807d\u8b49\u6703', ner='EVENT', idx=(26, 31))\n   NerToken(word='\u53c3\u8b70\u9662', ner='ORG', idx=(42, 45))\n   NerToken(word='\u7b2c\u4e00', ner='ORDINAL', idx=(56, 58))\n   NerToken(word='\u83ef\u88d4', ner='NORP', idx=(60, 62))\n\n   \u7a7a\u767d \u4e5f\u662f\u53ef\u4ee5\u7684\uff5e\n   \u7a7a\u767d(VH)\u3000 (WHITESPACE)\u3000\u4e5f(D)\u3000\u662f(SHI)\u3000\u53ef\u4ee5(VH)\u3000\u7684(T)\u3000\uff5e(FW)\n\nNLP Tools Performance\n^^^^^^^^^^^^^^^^^^^^^\n\n| The following is a performance comparison between our tool and other tools.\n| \u4ee5\u4e0b\u662f\u6211\u5011\u7684\u5de5\u5177\u8207\u5176\u4ed6\u7684\u5de5\u5177\u4e4b\u6027\u80fd\u6bd4\u8f03\u3002\n\nCKIP Transformers v.s. Monpa & Jeiba\n\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\n\n========================  ===========  =============  ===============  ============\nTool                        WS (F1)      POS (Acc)      WS+POS (F1)      NER (F1)\n========================  ===========  =============  ===============  ============\nCKIP BERT Base            **97.60%**   **95.67%**     **94.19%**       **81.18%**\nCKIP ALBERT Base            97.33%       95.30%         93.52%           79.47%\nCKIP BERT Tiny              96.98%       95.08%         93.13%           74.20%\nCKIP ALBERT Tiny            96.66%       94.48%         92.25%           71.17%\n------------------------  -----------  -------------  ---------------  ------------\n\n------------------------  -----------  -------------  ---------------  ------------\nMonpa\u2020                      92.58%       --             83.88%           --\nJeiba                       81.18%       --             --               --\n========================  ===========  =============  ===============  ============\n\n| \u2020 Monpa provides only 3 types of tags in NER.\n| \u2020 Monpa \u7684\u5be6\u9ad4\u8fa8\u8b58\u50c5\u63d0\u4f9b\u4e09\u7a2e\u6a19\u8a18\u800c\u5df2\u3002\n\nCKIP Transformers v.s. CkipTagger\n\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\n\n| The following results are tested on a different dataset.\u2020\n| \u4ee5\u4e0b\u5be6\u9a57\u5728\u53e6\u4e00\u500b\u8cc7\u6599\u96c6\u6e2c\u8a66\u3002\u2020\n\n========================  ===========  =============  ===============  ============\nTool                        WS (F1)      POS (Acc)      WS+POS (F1)      NER (F1)\n========================  ===========  =============  ===============  ============\nCKIP BERT Base            **97.84%**     96.46%       **94.91%**       **79.20%**\nCkipTagger                  97.33%     **97.20%**       94.75%           77.87%\n========================  ===========  =============  ===============  ============\n\n| \u2020 Here we retrained/tested our BERT model using the same dataset with CkipTagger.\n| \u2020 \u6211\u5011\u91cd\u65b0\u8a13\u7df4\uff0f\u6e2c\u8a66\u6211\u5011\u7684 BERT \u6a21\u578b\u65bc\u8ddf CkipTagger \u76f8\u540c\u7684\u8cc7\u6599\u96c6\u3002\n\nLicense\n-------\n\n|GPL-3.0|\n\nCopyright (c) 2023 `CKIP Lab <https://ckip.iis.sinica.edu.tw>`__ under the `GPL-3.0 License <https://www.gnu.org/licenses/gpl-3.0.html>`__.\n\n.. |GPL-3.0| image:: https://www.gnu.org/graphics/gplv3-with-text-136x68.png\n   :target: https://www.gnu.org/licenses/gpl-3.0.html\n\n\n",
    "bugtrack_url": null,
    "license": "GPL-3.0",
    "summary": "CKIP Transformers",
    "version": "0.3.4",
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "25a1deeacef7742ba978227c186c75367a0d7413dd1f12c9b84a0e3c9efe495a",
                "md5": "4db2630047bb6a12847c1b7c5c6d2ef0",
                "sha256": "5e79fc0b4af7ad7742e8e10c091ce4fafb02f14ac0ae2ba3c9917875e1ff3c54"
            },
            "downloads": -1,
            "filename": "ckip_transformers-0.3.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "4db2630047bb6a12847c1b7c5c6d2ef0",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 26556,
            "upload_time": "2023-04-09T13:03:02",
            "upload_time_iso_8601": "2023-04-09T13:03:02.324547Z",
            "url": "https://files.pythonhosted.org/packages/25/a1/deeacef7742ba978227c186c75367a0d7413dd1f12c9b84a0e3c9efe495a/ckip_transformers-0.3.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9a315c34c19ae6a562a0319d95a9fa03ed674e5daf31618db59d6293838497f3",
                "md5": "a443d48734a50d435b964ece12fe7e32",
                "sha256": "03c611efabba588141842de650b059cb7a70c7852ef062ea6dfa78dcf8827c91"
            },
            "downloads": -1,
            "filename": "ckip-transformers-0.3.4.tar.gz",
            "has_sig": false,
            "md5_digest": "a443d48734a50d435b964ece12fe7e32",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 32491,
            "upload_time": "2023-04-09T13:03:05",
            "upload_time_iso_8601": "2023-04-09T13:03:05.179463Z",
            "url": "https://files.pythonhosted.org/packages/9a/31/5c34c19ae6a562a0319d95a9fa03ed674e5daf31618db59d6293838497f3/ckip-transformers-0.3.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-04-09 13:03:05",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "lcname": "ckip-transformers"
}
        
Elapsed time: 0.05854s