kwja


Namekwja JSON
Version 2.4.0 PyPI version JSON
download
home_pagehttps://github.com/ku-nlp/kwja
SummaryA unified Japanese analyzer based on foundation models
upload_time2024-08-06 14:29:11
maintainerHirokazu Kiyomaru
docs_urlNone
authorHirokazu Kiyomaru
requires_python<3.13,>=3.8
licenseMIT
keywords nlp japanese
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage
            # KWJA: Kyoto-Waseda Japanese Analyzer[^1]

[^1]: Pronunciation: [/kuʒa/](http://ipa-reader.xyz/?text=%20ku%CA%92a)

[![test](https://github.com/ku-nlp/kwja/actions/workflows/test.yml/badge.svg)](https://github.com/ku-nlp/kwja/actions/workflows/test.yml)
[![codecov](https://codecov.io/gh/ku-nlp/kwja/branch/main/graph/badge.svg?token=A9FWWPLITO)](https://codecov.io/gh/ku-nlp/kwja)
[![CodeFactor Grade](https://img.shields.io/codefactor/grade/github/ku-nlp/kwja)](https://www.codefactor.io/repository/github/ku-nlp/kwja)
[![PyPI](https://img.shields.io/pypi/v/kwja)](https://pypi.org/project/kwja/)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/kwja)

[[Paper]](https://ipsj.ixsq.nii.ac.jp/ej/?action=pages_view_main&active_action=repository_view_main_item_detail&item_id=220232&item_no=1&page_id=13&block_id=8)
[[Slides]](https://speakerdeck.com/nobug/kyoto-waseda-japanese-analyzer)

KWJA is an integrated Japanese text analyzer based on foundation models.
KWJA performs many text analysis tasks, including:
- Typo correction
- Sentence segmentation
- Word segmentation
- Word normalization
- Morphological analysis
- Word feature tagging
- Base phrase feature tagging
- NER (Named Entity Recognition)
- Dependency parsing
- Predicate-argument structure (PAS) analysis
- Bridging reference resolution
- Coreference resolution
- Discourse relation analysis

## Requirements

- Python: 3.8+
- Dependencies: See [pyproject.toml](./pyproject.toml).
- GPUs with CUDA 11.7 (optional)

## Getting Started

Install KWJA with pip:

```shell
$ pip install kwja
```

Perform language analysis with the `kwja` command (the result is in the KNP format):

```shell
# Analyze a text
$ kwja --text "KWJAは日本語の統合解析ツールです。汎用言語モデルを利用し、様々な言語解析を統一的な方法で解いています。"

# Analyze text files and write the result to a file
$ kwja --filename path/to/file1.txt --filename path/to/file2.txt > path/to/analyzed.knp

# Analyze texts interactively
$ kwja
Please end your input with a new line and type "EOD"
KWJAは日本語の統合解析ツールです。汎用言語モデルを利用し、様々な言語解析を統一的な方法で解いています。
EOD
```

If you use Windows and PowerShell, you need to set `PYTHONUTF8` environment variable to `1`:

```shell
> $env:PYTHONUTF8 = "1"
> kwja ...
````

The output is in the KNP format, which looks like the following:

```
# S-ID:202210010000-0-0 kwja:1.0.2
* 2D
+ 5D <rel type="=" target="ツール" sid="202210011918-0-0" id="5"/><体言><NE:ARTIFACT:KWJA>
KWJA KWJA KWJA 名詞 6 固有名詞 3 * 0 * 0 <基本句-主辞>
は は は 助詞 9 副助詞 2 * 0 * 0 "代表表記:は/は" <代表表記:は/は>
* 2D
+ 2D <体言>
日本 にほん 日本 名詞 6 地名 4 * 0 * 0 "代表表記:日本/にほん 地名:国" <代表表記:日本/にほん><地名:国><基本句-主辞>
+ 4D <体言><係:ノ格>
語 ご 語 名詞 6 普通名詞 1 * 0 * 0 "代表表記:語/ご 漢字読み:音 カテゴリ:抽象物" <代表表記:語/ご><漢字読み:音><カテゴリ:抽象物><基本句-主辞>
の の の 助詞 9 接続助詞 3 * 0 * 0 "代表表記:の/の" <代表表記:の/の>
...
```

Here are options for `kwja` command:

- `--text`: Text to be analyzed.

- `--filename`: Path to a text file to be analyzed. You can specify this option multiple times.

- `--model-size`: Model size to be used. Specify one of `tiny`, `base` (default), and `large`.

- `--device`: Device to be used. Specify `cpu` or `gpu`. If not specified, the device is automatically selected.

- `--typo-batch-size`: Batch size for typo module.

- `--senter-batch-size`: Batch size for senter module.

- `--seq2seq-batch-size`: Batch size for seq2seq module.

- `--char-batch-size`: Batch size for character module.

- `--word-batch-size`: Batch size for word module.

- `--tasks`: Tasks to be performed. Specify one or more of the following values separated by commas:
  - `typo`: Typo correction
  - `senter`: Sentence segmentation
  - `seq2seq`: Word segmentation, Word normalization, Reading prediction, lemmatization, and Canonicalization.
  - `char`: Word segmentation and Word normalization
  - `word`: Morphological analysis, Named entity recognition, Word feature tagging, Dependency parsing, PAS analysis, Bridging reference resolution, and Coreference resolution

`--config-file`: Path to a custom configuration file.

You can read a KNP format file with [rhoknp](https://github.com/ku-nlp/rhoknp).

```python
from rhoknp import Document
with open("analyzed.knp") as f:
    parsed_document = Document.from_knp(f.read())
```

For more details about KNP format, see [Reference](#reference).

## Usage from Python

Make sure you have `kwja` command in your path:

```shell
$ which kwja
/path/to/kwja
```

Install [rhoknp](https://github.com/ku-nlp/rhoknp):

```shell
$ pip install rhoknp
```

Perform language analysis with the `kwja` instance:

```python
from rhoknp import KWJA
kwja = KWJA()
analyzed_document = kwja.apply(
    "KWJAは日本語の統合解析ツールです。汎用言語モデルを利用し、様々な言語解析を統一的な方法で解いています。"
)
```

## Configuration

`kwja` can be configured with a configuration file to set the default options.
 Check [Config file content](#config-file-example) for details.

### Config file location

On non-Windows systems `kwja` follows the
[XDG Base Directory Specification](https://specifications.freedesktop.org/basedir-spec/basedir-spec-latest.html)
convention for the location of the configuration file.
The configuration dir `kwja` uses is itself named `kwja`.
In that directory it refers to a file named `config.yaml`.
For most people it should be enough to put their config file at `~/.config/kwja/config.yaml`.
You can also provide a configuration file in a non-standard location with an environment variable `KWJA_CONFIG_FILE` or a command line option `--config-file`.

### Config file example

```yaml
model_size: base
device: cpu
num_workers: 0
torch_compile: false
typo_batch_size: 1
senter_batch_size: 1
seq2seq_batch_size: 1
char_batch_size: 1
word_batch_size: 1
```

## Performance Table

- typo, senter, character, and word modules
  - The performance on each task except typo correction and discourse relation analysis is the mean over all the corpora (KC, KWDLC, Fuman, and WAC) and over three runs with different random seeds.
  - We set the learning rate of RoBERTa<sub>LARGE</sub> (word) to 2e-5 because we failed to fine-tune it with a higher learning rate.
    Other hyperparameters are the same described in configs, which are tuned for DeBERTa<sub>BASE</sub>.
- seq2seq module
  - The performance on each task is the mean over all the corpora (KC, KWDLC, Fuman, and WAC).
    - \* denotes results of a single run
  - Scores are calculated using a separate [script](https://github.com/ku-nlp/kwja/blob/main/scripts/view_seq2seq_results.py) from the character and word modules.

<table>
  <thead>
    <tr>
      <th rowspan="2" colspan="2">Task</th>
      <th colspan="6">Model</th>
    </tr>
    <tr>
      <th>
        v1.x base<br>
        (
            <a href="https://huggingface.co/ku-nlp/roberta-base-japanese-char-wwm">char</a>,
            <a href="https://huggingface.co/nlp-waseda/roberta-base-japanese">word</a>
        )
      </th>
      <th>
        v2.x base<br>
        (
            <a href="https://huggingface.co/ku-nlp/deberta-v2-base-japanese-char-wwm">char</a>,
            <a href="https://huggingface.co/ku-nlp/deberta-v2-base-japanese">word</a> /
            <a href="https://huggingface.co/retrieva-jp/t5-base-long">seq2seq</a>
        )
      </th>
      <th>
        v1.x large<br>
        (
            <a href="https://huggingface.co/ku-nlp/roberta-large-japanese-char-wwm">char</a>,
            <a href="https://huggingface.co/nlp-waseda/roberta-large-japanese-seq512">word</a>
        )
      </th>
      <th>
        v2.x large<br>
        (
            <a href="https://huggingface.co/ku-nlp/deberta-v2-large-japanese-char-wwm">char</a>,
            <a href="https://huggingface.co/ku-nlp/deberta-v2-large-japanese">word</a> /
            <a href="https://huggingface.co/retrieva-jp/t5-large-long">seq2seq</a>
        )
      </th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th colspan="2">Typo Correction</th>
      <td>79.0</td>
      <td>76.7</td>
      <td>80.8</td>
      <td>83.1</td>
    </tr>
    <tr>
      <th colspan="2">Sentence Segmentation</th>
      <td>-</td>
      <td>98.4</td>
      <td>-</td>
      <td>98.6</td>
    </tr>
    <tr>
      <th colspan="2">Word Segmentation</th>
      <td>98.5</td>
      <td>98.1 / 98.2*</td>
      <td>98.7</td>
      <td>98.4 / 98.4*</td>
    </tr>
    <tr>
      <th colspan="2">Word Normalization</th>
      <td>44.0</td>
      <td>15.3</td>
      <td>39.8</td>
      <td>48.6</td>
    </tr>
    <tr>
      <th rowspan="7">Morphological Analysis</th>
      <th>POS</th>
      <td>99.3</td>
      <td>99.4</td>
      <td>99.3</td>
      <td>99.4</td>
    </tr>
    <tr>
      <th>sub-POS</th>
      <td>98.1</td>
      <td>98.5</td>
      <td>98.2</td>
      <td>98.5</td>
    </tr>
    <tr>
      <th>conjtype</th>
      <td>99.4</td>
      <td>99.6</td>
      <td>99.2</td>
      <td>99.6</td>
    </tr>
    <tr>
      <th>conjform</th>
      <td>99.5</td>
      <td>99.7</td>
      <td>99.4</td>
      <td>99.7</td>
    </tr>
    <tr>
      <th>reading</th>
      <td>95.5</td>
      <td>95.4 / 96.2*</td>
      <td>90.8</td>
      <td>95.6 / 96.8*</td>
    </tr>
    <tr>
      <th>lemma</th>
      <td>-</td>
      <td>- / 97.8*</td>
      <td>-</td>
      <td>- / 98.1*</td>
    </tr>
    <tr>
      <th>canon</th>
      <td>-</td>
      <td>- / 95.2*</td>
      <td>-</td>
      <td>- / 95.9*</td>
    </tr>
    <tr>
      <th colspan="2">Named Entity Recognition</th>
      <td>83.0</td>
      <td>84.6</td>
      <td>82.1</td>
      <td>85.9</td>
    </tr>
    <tr>
      <th rowspan="2">Linguistic Feature Tagging</th>
      <th>word</th>
      <td>98.3</td>
      <td>98.6</td>
      <td>98.5</td>
      <td>98.6</td>
    </tr>
    <tr>
      <th>base phrase</th>
      <td>86.6</td>
      <td>93.6</td>
      <td>86.4</td>
      <td>93.4</td>
    </tr>
    <tr>
      <th colspan="2">Dependency Parsing</th>
      <td>92.9</td>
      <td>93.5</td>
      <td>93.8</td>
      <td>93.6</td>
    </tr>
    <tr>
      <th colspan="2">Pas Analysis</th>
      <td>74.2</td>
      <td>76.9</td>
      <td>75.3</td>
      <td>77.5</td>
    </tr>
    <tr>
      <th colspan="2">Bridging Reference Resolution</th>
      <td>66.5</td>
      <td>67.3</td>
      <td>65.2</td>
      <td>67.5</td>
    </tr>
    <tr>
      <th colspan="2">Coreference Resolution</th>
      <td>74.9</td>
      <td>78.6</td>
      <td>75.9</td>
      <td>79.2</td>
    </tr>
    <tr>
      <th colspan="2">Discourse Relation Analysis</th>
      <td>42.2</td>
      <td>39.2</td>
      <td>41.3</td>
      <td>44.3</td>
    </tr>
  </tbody>
</table>

## Citation

```bibtex
@InProceedings{Ueda2023a,
  author    = {Nobuhiro Ueda and Kazumasa Omura and Takashi Kodama and Hirokazu Kiyomaru and Yugo Murawaki and Daisuke Kawahara and Sadao Kurohashi},
  title     = {KWJA: A Unified Japanese Analyzer Based on Foundation Models},
  booktitle = {Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics: System Demonstrations},
  year      = {2023},
  address   = {Toronto, Canada},
}
```

```bibtex
@InProceedings{植田2022,
  author    = {植田 暢大 and 大村 和正 and 児玉 貴志 and 清丸 寛一 and 村脇 有吾 and 河原 大輔 and 黒橋 禎夫},
  title     = {KWJA:汎用言語モデルに基づく日本語解析器},
  booktitle = {第253回自然言語処理研究会},
  year      = {2022},
  address   = {京都},
}
```

```bibtex
@InProceedings{児玉2023,
  author    = {児玉 貴志 and 植田 暢大 and 大村 和正 and 清丸 寛一 and 村脇 有吾 and 河原 大輔 and 黒橋 禎夫},
  title     = {テキスト生成モデルによる日本語形態素解析},
  booktitle = {言語処理学会 第29回年次大会},
  year      = {2023},
  address   = {沖縄},
}
```

## License

This software is released under the MIT License, see [LICENSE](LICENSE).

## Reference

- [KNP format](http://cr.fvcrc.i.nagoya-u.ac.jp/~sasano/knp/format.html)

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/ku-nlp/kwja",
    "name": "kwja",
    "maintainer": "Hirokazu Kiyomaru",
    "docs_url": null,
    "requires_python": "<3.13,>=3.8",
    "maintainer_email": "kiyomaru@i.kyoto-u.ac.jp",
    "keywords": "NLP, japanese",
    "author": "Hirokazu Kiyomaru",
    "author_email": "kiyomaru@i.kyoto-u.ac.jp",
    "download_url": "https://files.pythonhosted.org/packages/5b/b1/5fd38348a582307c78316003e729285337b02f7afc2ecaf918a75a4ef689/kwja-2.4.0.tar.gz",
    "platform": null,
    "description": "# KWJA: Kyoto-Waseda Japanese Analyzer[^1]\n\n[^1]: Pronunciation: [/ku\u0292a/](http://ipa-reader.xyz/?text=%20ku%CA%92a)\n\n[![test](https://github.com/ku-nlp/kwja/actions/workflows/test.yml/badge.svg)](https://github.com/ku-nlp/kwja/actions/workflows/test.yml)\n[![codecov](https://codecov.io/gh/ku-nlp/kwja/branch/main/graph/badge.svg?token=A9FWWPLITO)](https://codecov.io/gh/ku-nlp/kwja)\n[![CodeFactor Grade](https://img.shields.io/codefactor/grade/github/ku-nlp/kwja)](https://www.codefactor.io/repository/github/ku-nlp/kwja)\n[![PyPI](https://img.shields.io/pypi/v/kwja)](https://pypi.org/project/kwja/)\n![PyPI - Python Version](https://img.shields.io/pypi/pyversions/kwja)\n\n[[Paper]](https://ipsj.ixsq.nii.ac.jp/ej/?action=pages_view_main&active_action=repository_view_main_item_detail&item_id=220232&item_no=1&page_id=13&block_id=8)\n[[Slides]](https://speakerdeck.com/nobug/kyoto-waseda-japanese-analyzer)\n\nKWJA is an integrated Japanese text analyzer based on foundation models.\nKWJA performs many text analysis tasks, including:\n- Typo correction\n- Sentence segmentation\n- Word segmentation\n- Word normalization\n- Morphological analysis\n- Word feature tagging\n- Base phrase feature tagging\n- NER (Named Entity Recognition)\n- Dependency parsing\n- Predicate-argument structure (PAS) analysis\n- Bridging reference resolution\n- Coreference resolution\n- Discourse relation analysis\n\n## Requirements\n\n- Python: 3.8+\n- Dependencies: See [pyproject.toml](./pyproject.toml).\n- GPUs with CUDA 11.7 (optional)\n\n## Getting Started\n\nInstall KWJA with pip:\n\n```shell\n$ pip install kwja\n```\n\nPerform language analysis with the `kwja` command (the result is in the KNP format):\n\n```shell\n# Analyze a text\n$ kwja --text \"KWJA\u306f\u65e5\u672c\u8a9e\u306e\u7d71\u5408\u89e3\u6790\u30c4\u30fc\u30eb\u3067\u3059\u3002\u6c4e\u7528\u8a00\u8a9e\u30e2\u30c7\u30eb\u3092\u5229\u7528\u3057\u3001\u69d8\u3005\u306a\u8a00\u8a9e\u89e3\u6790\u3092\u7d71\u4e00\u7684\u306a\u65b9\u6cd5\u3067\u89e3\u3044\u3066\u3044\u307e\u3059\u3002\"\n\n# Analyze text files and write the result to a file\n$ kwja --filename path/to/file1.txt --filename path/to/file2.txt > path/to/analyzed.knp\n\n# Analyze texts interactively\n$ kwja\nPlease end your input with a new line and type \"EOD\"\nKWJA\u306f\u65e5\u672c\u8a9e\u306e\u7d71\u5408\u89e3\u6790\u30c4\u30fc\u30eb\u3067\u3059\u3002\u6c4e\u7528\u8a00\u8a9e\u30e2\u30c7\u30eb\u3092\u5229\u7528\u3057\u3001\u69d8\u3005\u306a\u8a00\u8a9e\u89e3\u6790\u3092\u7d71\u4e00\u7684\u306a\u65b9\u6cd5\u3067\u89e3\u3044\u3066\u3044\u307e\u3059\u3002\nEOD\n```\n\nIf you use Windows and PowerShell, you need to set `PYTHONUTF8` environment variable to `1`:\n\n```shell\n> $env:PYTHONUTF8 = \"1\"\n> kwja ...\n````\n\nThe output is in the KNP format, which looks like the following:\n\n```\n# S-ID:202210010000-0-0 kwja:1.0.2\n* 2D\n+ 5D <rel type=\"=\" target=\"\u30c4\u30fc\u30eb\" sid=\"202210011918-0-0\" id=\"5\"/><\u4f53\u8a00><NE:ARTIFACT:KWJA>\nKWJA \uff2bW\uff2a\uff21 KWJA \u540d\u8a5e 6 \u56fa\u6709\u540d\u8a5e 3 * 0 * 0 <\u57fa\u672c\u53e5-\u4e3b\u8f9e>\n\u306f \u306f \u306f \u52a9\u8a5e 9 \u526f\u52a9\u8a5e 2 * 0 * 0 \"\u4ee3\u8868\u8868\u8a18:\u306f/\u306f\" <\u4ee3\u8868\u8868\u8a18:\u306f/\u306f>\n* 2D\n+ 2D <\u4f53\u8a00>\n\u65e5\u672c \u306b\u307b\u3093 \u65e5\u672c \u540d\u8a5e 6 \u5730\u540d 4 * 0 * 0 \"\u4ee3\u8868\u8868\u8a18:\u65e5\u672c/\u306b\u307b\u3093 \u5730\u540d:\u56fd\" <\u4ee3\u8868\u8868\u8a18:\u65e5\u672c/\u306b\u307b\u3093><\u5730\u540d:\u56fd><\u57fa\u672c\u53e5-\u4e3b\u8f9e>\n+ 4D <\u4f53\u8a00><\u4fc2:\u30ce\u683c>\n\u8a9e \u3054 \u8a9e \u540d\u8a5e 6 \u666e\u901a\u540d\u8a5e 1 * 0 * 0 \"\u4ee3\u8868\u8868\u8a18:\u8a9e/\u3054 \u6f22\u5b57\u8aad\u307f:\u97f3 \u30ab\u30c6\u30b4\u30ea:\u62bd\u8c61\u7269\" <\u4ee3\u8868\u8868\u8a18:\u8a9e/\u3054><\u6f22\u5b57\u8aad\u307f:\u97f3><\u30ab\u30c6\u30b4\u30ea:\u62bd\u8c61\u7269><\u57fa\u672c\u53e5-\u4e3b\u8f9e>\n\u306e \u306e \u306e \u52a9\u8a5e 9 \u63a5\u7d9a\u52a9\u8a5e 3 * 0 * 0 \"\u4ee3\u8868\u8868\u8a18:\u306e/\u306e\" <\u4ee3\u8868\u8868\u8a18:\u306e/\u306e>\n...\n```\n\nHere are options for `kwja` command:\n\n- `--text`: Text to be analyzed.\n\n- `--filename`: Path to a text file to be analyzed. You can specify this option multiple times.\n\n- `--model-size`: Model size to be used. Specify one of `tiny`, `base` (default), and `large`.\n\n- `--device`: Device to be used. Specify `cpu` or `gpu`. If not specified, the device is automatically selected.\n\n- `--typo-batch-size`: Batch size for typo module.\n\n- `--senter-batch-size`: Batch size for senter module.\n\n- `--seq2seq-batch-size`: Batch size for seq2seq module.\n\n- `--char-batch-size`: Batch size for character module.\n\n- `--word-batch-size`: Batch size for word module.\n\n- `--tasks`: Tasks to be performed. Specify one or more of the following values separated by commas:\n  - `typo`: Typo correction\n  - `senter`: Sentence segmentation\n  - `seq2seq`: Word segmentation, Word normalization, Reading prediction, lemmatization, and Canonicalization.\n  - `char`: Word segmentation and Word normalization\n  - `word`: Morphological analysis, Named entity recognition, Word feature tagging, Dependency parsing, PAS analysis, Bridging reference resolution, and Coreference resolution\n\n`--config-file`: Path to a custom configuration file.\n\nYou can read a KNP format file with [rhoknp](https://github.com/ku-nlp/rhoknp).\n\n```python\nfrom rhoknp import Document\nwith open(\"analyzed.knp\") as f:\n    parsed_document = Document.from_knp(f.read())\n```\n\nFor more details about KNP format, see [Reference](#reference).\n\n## Usage from Python\n\nMake sure you have `kwja` command in your path:\n\n```shell\n$ which kwja\n/path/to/kwja\n```\n\nInstall [rhoknp](https://github.com/ku-nlp/rhoknp):\n\n```shell\n$ pip install rhoknp\n```\n\nPerform language analysis with the `kwja` instance:\n\n```python\nfrom rhoknp import KWJA\nkwja = KWJA()\nanalyzed_document = kwja.apply(\n    \"KWJA\u306f\u65e5\u672c\u8a9e\u306e\u7d71\u5408\u89e3\u6790\u30c4\u30fc\u30eb\u3067\u3059\u3002\u6c4e\u7528\u8a00\u8a9e\u30e2\u30c7\u30eb\u3092\u5229\u7528\u3057\u3001\u69d8\u3005\u306a\u8a00\u8a9e\u89e3\u6790\u3092\u7d71\u4e00\u7684\u306a\u65b9\u6cd5\u3067\u89e3\u3044\u3066\u3044\u307e\u3059\u3002\"\n)\n```\n\n## Configuration\n\n`kwja` can be configured with a configuration file to set the default options.\n Check [Config file content](#config-file-example) for details.\n\n### Config file location\n\nOn non-Windows systems `kwja` follows the\n[XDG Base Directory Specification](https://specifications.freedesktop.org/basedir-spec/basedir-spec-latest.html)\nconvention for the location of the configuration file.\nThe configuration dir `kwja` uses is itself named `kwja`.\nIn that directory it refers to a file named `config.yaml`.\nFor most people it should be enough to put their config file at `~/.config/kwja/config.yaml`.\nYou can also provide a configuration file in a non-standard location with an environment variable `KWJA_CONFIG_FILE` or a command line option `--config-file`.\n\n### Config file example\n\n```yaml\nmodel_size: base\ndevice: cpu\nnum_workers: 0\ntorch_compile: false\ntypo_batch_size: 1\nsenter_batch_size: 1\nseq2seq_batch_size: 1\nchar_batch_size: 1\nword_batch_size: 1\n```\n\n## Performance Table\n\n- typo, senter, character, and word modules\n  - The performance on each task except typo correction and discourse relation analysis is the mean over all the corpora (KC, KWDLC, Fuman, and WAC) and over three runs with different random seeds.\n  - We set the learning rate of RoBERTa<sub>LARGE</sub> (word) to 2e-5 because we failed to fine-tune it with a higher learning rate.\n    Other hyperparameters are the same described in configs, which are tuned for DeBERTa<sub>BASE</sub>.\n- seq2seq module\n  - The performance on each task is the mean over all the corpora (KC, KWDLC, Fuman, and WAC).\n    - \\* denotes results of a single run\n  - Scores are calculated using a separate [script](https://github.com/ku-nlp/kwja/blob/main/scripts/view_seq2seq_results.py) from the character and word modules.\n\n<table>\n  <thead>\n    <tr>\n      <th rowspan=\"2\" colspan=\"2\">Task</th>\n      <th colspan=\"6\">Model</th>\n    </tr>\n    <tr>\n      <th>\n        v1.x base<br>\n        (\n            <a href=\"https://huggingface.co/ku-nlp/roberta-base-japanese-char-wwm\">char</a>,\n            <a href=\"https://huggingface.co/nlp-waseda/roberta-base-japanese\">word</a>\n        )\n      </th>\n      <th>\n        v2.x base<br>\n        (\n            <a href=\"https://huggingface.co/ku-nlp/deberta-v2-base-japanese-char-wwm\">char</a>,\n            <a href=\"https://huggingface.co/ku-nlp/deberta-v2-base-japanese\">word</a> /\n            <a href=\"https://huggingface.co/retrieva-jp/t5-base-long\">seq2seq</a>\n        )\n      </th>\n      <th>\n        v1.x large<br>\n        (\n            <a href=\"https://huggingface.co/ku-nlp/roberta-large-japanese-char-wwm\">char</a>,\n            <a href=\"https://huggingface.co/nlp-waseda/roberta-large-japanese-seq512\">word</a>\n        )\n      </th>\n      <th>\n        v2.x large<br>\n        (\n            <a href=\"https://huggingface.co/ku-nlp/deberta-v2-large-japanese-char-wwm\">char</a>,\n            <a href=\"https://huggingface.co/ku-nlp/deberta-v2-large-japanese\">word</a> /\n            <a href=\"https://huggingface.co/retrieva-jp/t5-large-long\">seq2seq</a>\n        )\n      </th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th colspan=\"2\">Typo Correction</th>\n      <td>79.0</td>\n      <td>76.7</td>\n      <td>80.8</td>\n      <td>83.1</td>\n    </tr>\n    <tr>\n      <th colspan=\"2\">Sentence Segmentation</th>\n      <td>-</td>\n      <td>98.4</td>\n      <td>-</td>\n      <td>98.6</td>\n    </tr>\n    <tr>\n      <th colspan=\"2\">Word Segmentation</th>\n      <td>98.5</td>\n      <td>98.1 / 98.2*</td>\n      <td>98.7</td>\n      <td>98.4 / 98.4*</td>\n    </tr>\n    <tr>\n      <th colspan=\"2\">Word Normalization</th>\n      <td>44.0</td>\n      <td>15.3</td>\n      <td>39.8</td>\n      <td>48.6</td>\n    </tr>\n    <tr>\n      <th rowspan=\"7\">Morphological Analysis</th>\n      <th>POS</th>\n      <td>99.3</td>\n      <td>99.4</td>\n      <td>99.3</td>\n      <td>99.4</td>\n    </tr>\n    <tr>\n      <th>sub-POS</th>\n      <td>98.1</td>\n      <td>98.5</td>\n      <td>98.2</td>\n      <td>98.5</td>\n    </tr>\n    <tr>\n      <th>conjtype</th>\n      <td>99.4</td>\n      <td>99.6</td>\n      <td>99.2</td>\n      <td>99.6</td>\n    </tr>\n    <tr>\n      <th>conjform</th>\n      <td>99.5</td>\n      <td>99.7</td>\n      <td>99.4</td>\n      <td>99.7</td>\n    </tr>\n    <tr>\n      <th>reading</th>\n      <td>95.5</td>\n      <td>95.4 / 96.2*</td>\n      <td>90.8</td>\n      <td>95.6 / 96.8*</td>\n    </tr>\n    <tr>\n      <th>lemma</th>\n      <td>-</td>\n      <td>- / 97.8*</td>\n      <td>-</td>\n      <td>- / 98.1*</td>\n    </tr>\n    <tr>\n      <th>canon</th>\n      <td>-</td>\n      <td>- / 95.2*</td>\n      <td>-</td>\n      <td>- / 95.9*</td>\n    </tr>\n    <tr>\n      <th colspan=\"2\">Named Entity Recognition</th>\n      <td>83.0</td>\n      <td>84.6</td>\n      <td>82.1</td>\n      <td>85.9</td>\n    </tr>\n    <tr>\n      <th rowspan=\"2\">Linguistic Feature Tagging</th>\n      <th>word</th>\n      <td>98.3</td>\n      <td>98.6</td>\n      <td>98.5</td>\n      <td>98.6</td>\n    </tr>\n    <tr>\n      <th>base phrase</th>\n      <td>86.6</td>\n      <td>93.6</td>\n      <td>86.4</td>\n      <td>93.4</td>\n    </tr>\n    <tr>\n      <th colspan=\"2\">Dependency Parsing</th>\n      <td>92.9</td>\n      <td>93.5</td>\n      <td>93.8</td>\n      <td>93.6</td>\n    </tr>\n    <tr>\n      <th colspan=\"2\">Pas Analysis</th>\n      <td>74.2</td>\n      <td>76.9</td>\n      <td>75.3</td>\n      <td>77.5</td>\n    </tr>\n    <tr>\n      <th colspan=\"2\">Bridging Reference Resolution</th>\n      <td>66.5</td>\n      <td>67.3</td>\n      <td>65.2</td>\n      <td>67.5</td>\n    </tr>\n    <tr>\n      <th colspan=\"2\">Coreference Resolution</th>\n      <td>74.9</td>\n      <td>78.6</td>\n      <td>75.9</td>\n      <td>79.2</td>\n    </tr>\n    <tr>\n      <th colspan=\"2\">Discourse Relation Analysis</th>\n      <td>42.2</td>\n      <td>39.2</td>\n      <td>41.3</td>\n      <td>44.3</td>\n    </tr>\n  </tbody>\n</table>\n\n## Citation\n\n```bibtex\n@InProceedings{Ueda2023a,\n  author    = {Nobuhiro Ueda and Kazumasa Omura and Takashi Kodama and Hirokazu Kiyomaru and Yugo Murawaki and Daisuke Kawahara and Sadao Kurohashi},\n  title     = {KWJA: A Unified Japanese Analyzer Based on Foundation Models},\n  booktitle = {Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics: System Demonstrations},\n  year      = {2023},\n  address   = {Toronto, Canada},\n}\n```\n\n```bibtex\n@InProceedings{\u690d\u75302022,\n  author    = {\u690d\u7530 \u66a2\u5927 and \u5927\u6751 \u548c\u6b63 and \u5150\u7389 \u8cb4\u5fd7 and \u6e05\u4e38 \u5bdb\u4e00 and \u6751\u8107 \u6709\u543e and \u6cb3\u539f \u5927\u8f14 and \u9ed2\u6a4b \u798e\u592b},\n  title     = {KWJA\uff1a\u6c4e\u7528\u8a00\u8a9e\u30e2\u30c7\u30eb\u306b\u57fa\u3065\u304f\u65e5\u672c\u8a9e\u89e3\u6790\u5668},\n  booktitle = {\u7b2c253\u56de\u81ea\u7136\u8a00\u8a9e\u51e6\u7406\u7814\u7a76\u4f1a},\n  year      = {2022},\n  address   = {\u4eac\u90fd},\n}\n```\n\n```bibtex\n@InProceedings{\u5150\u73892023,\n  author    = {\u5150\u7389 \u8cb4\u5fd7 and \u690d\u7530 \u66a2\u5927 and \u5927\u6751 \u548c\u6b63 and \u6e05\u4e38 \u5bdb\u4e00 and \u6751\u8107 \u6709\u543e and \u6cb3\u539f \u5927\u8f14 and \u9ed2\u6a4b \u798e\u592b},\n  title     = {\u30c6\u30ad\u30b9\u30c8\u751f\u6210\u30e2\u30c7\u30eb\u306b\u3088\u308b\u65e5\u672c\u8a9e\u5f62\u614b\u7d20\u89e3\u6790},\n  booktitle = {\u8a00\u8a9e\u51e6\u7406\u5b66\u4f1a \u7b2c29\u56de\u5e74\u6b21\u5927\u4f1a},\n  year      = {2023},\n  address   = {\u6c96\u7e04},\n}\n```\n\n## License\n\nThis software is released under the MIT License, see [LICENSE](LICENSE).\n\n## Reference\n\n- [KNP format](http://cr.fvcrc.i.nagoya-u.ac.jp/~sasano/knp/format.html)\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A unified Japanese analyzer based on foundation models",
    "version": "2.4.0",
    "project_urls": {
        "Homepage": "https://github.com/ku-nlp/kwja",
        "Repository": "https://github.com/ku-nlp/kwja"
    },
    "split_keywords": [
        "nlp",
        " japanese"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "284db288ef19cb7c3ef01f39fa59c62ad08322da84242860fa3741e118da5174",
                "md5": "d63cf60c97bf82ecd9df96941b5dda7d",
                "sha256": "6f869ddfe702799b1c0639a6300f64f3adb7bb334211c9ccbc204512f9691849"
            },
            "downloads": -1,
            "filename": "kwja-2.4.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d63cf60c97bf82ecd9df96941b5dda7d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.13,>=3.8",
            "size": 21741891,
            "upload_time": "2024-08-06T14:29:06",
            "upload_time_iso_8601": "2024-08-06T14:29:06.927669Z",
            "url": "https://files.pythonhosted.org/packages/28/4d/b288ef19cb7c3ef01f39fa59c62ad08322da84242860fa3741e118da5174/kwja-2.4.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5bb15fd38348a582307c78316003e729285337b02f7afc2ecaf918a75a4ef689",
                "md5": "6aff23050ad970799a4acd6ee25d882e",
                "sha256": "f022037824ffad4aeccb50e017caa3e9f2f785f07661a22775994765e14a92ea"
            },
            "downloads": -1,
            "filename": "kwja-2.4.0.tar.gz",
            "has_sig": false,
            "md5_digest": "6aff23050ad970799a4acd6ee25d882e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.13,>=3.8",
            "size": 21632065,
            "upload_time": "2024-08-06T14:29:11",
            "upload_time_iso_8601": "2024-08-06T14:29:11.021352Z",
            "url": "https://files.pythonhosted.org/packages/5b/b1/5fd38348a582307c78316003e729285337b02f7afc2ecaf918a75a4ef689/kwja-2.4.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-08-06 14:29:11",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ku-nlp",
    "github_project": "kwja",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": true,
    "lcname": "kwja"
}
        
Elapsed time: 3.99273s