nagisa-bert


Namenagisa-bert JSON
Version 0.0.4 PyPI version JSON
download
home_pagehttps://github.com/taishi-i/nagisa_bert
SummaryA BERT model for nagisa
upload_time2023-12-23 07:24:39
maintainertaishi-i
docs_urlNone
authortaishi-i
requires_python>=3.7
licenseMIT
keywords nlp bert transformers japanese
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # nagisa_bert

[![Python package](https://github.com/taishi-i/nagisa_bert/actions/workflows/python-package.yml/badge.svg)](https://github.com/taishi-i/nagisa_bert/actions/workflows/python-package.yml)
[![PyPI version](https://badge.fury.io/py/nagisa_bert.svg)](https://badge.fury.io/py/nagisa_bert)

This library provides a tokenizer to use [a Japanese BERT model](https://huggingface.co/taishi-i/nagisa_bert) for [nagisa](https://github.com/taishi-i/nagisa).
The model is available in [Transformers](https://github.com/huggingface/transformers) πŸ€—.

You can try fill-mask using nagisa_bert at [Hugging Face Space](https://huggingface.co/spaces/taishi-i/nagisa_bert-fill_mask).


## Install

Python 3.7+ on Linux or macOS is required.
You can install *nagisa_bert* by using the *pip* command.


```bash
$ pip install nagisa_bert
```

## Usage

This model is available in Transformer's pipeline method.

```python
from transformers import pipeline
from nagisa_bert import NagisaBertTokenizer

text = "nagisaで[MASK]できるヒデルです"
tokenizer = NagisaBertTokenizer.from_pretrained("taishi-i/nagisa_bert")
fill_mask = pipeline("fill-mask", model='taishi-i/nagisa_bert', tokenizer=tokenizer)
print(fill_mask(text))
```

```python
[{'score': 0.1385931372642517,
  'sequence': 'nagisa で 使用 できる ヒデル です',
  'token': 8092,
  'token_str': 'δ½Ώ 用'},
 {'score': 0.11947669088840485,
  'sequence': 'nagisa で εˆ©η”¨ できる ヒデル です',
  'token': 8252,
  'token_str': '利 用'},
 {'score': 0.04910655692219734,
  'sequence': 'nagisa で 作成 できる ヒデル です',
  'token': 9559,
  'token_str': '作 成'},
 {'score': 0.03792576864361763,
  'sequence': 'nagisa で θ³Όε…₯ できる ヒデル です',
  'token': 9430,
  'token_str': 'θ³Ό ε…₯'},
 {'score': 0.026893319562077522,
  'sequence': 'nagisa で ε…₯手 できる ヒデル です',
  'token': 11273,
  'token_str': 'ε…₯ 手'}]
```

Tokenization and vectorization.

```python
from transformers import BertModel
from nagisa_bert import NagisaBertTokenizer

text = "nagisaで[MASK]できるヒデルです"
tokenizer = NagisaBertTokenizer.from_pretrained("taishi-i/nagisa_bert")
tokens = tokenizer.tokenize(text)
print(tokens)
# ['na', '##g', '##is', '##a', 'で', '[MASK]', 'できる', 'ヒデル', 'です']

model = BertModel.from_pretrained("taishi-i/nagisa_bert")
h = model(**tokenizer(text, return_tensors="pt")).last_hidden_state
print(h)
```

```python
tensor([[[-0.2912, -0.6818, -0.4097,  ...,  0.0262, -0.3845,  0.5816],
         [ 0.2504,  0.2143,  0.5809,  ..., -0.5428,  1.1805,  1.8701],
         [ 0.1890, -0.5816, -0.5469,  ..., -1.2081, -0.2341,  1.0215],
         ...,
         [-0.4360, -0.2546, -0.2824,  ...,  0.7420, -0.2904,  0.3070],
         [-0.6598, -0.7607,  0.0034,  ...,  0.2982,  0.5126,  1.1403],
         [-0.2505, -0.6574, -0.0523,  ...,  0.9082,  0.5851,  1.2625]]],
       grad_fn=<NativeLayerNormBackward0>)
```

## Tutorial

You can find here a list of the notebooks on Japanese NLP using pre-trained models and transformers.


| Notebook     |      Description      |   |
|:----------|:-------------|------:|
| [Fill-mask](https://github.com/taishi-i/nagisa_bert/blob/develop/notebooks/fill_mask-japanese_bert_models.ipynb)  | How to use the pipeline function in transformers to fill in Japanese text. |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/taishi-i/nagisa_bert/blob/develop/notebooks/fill_mask-japanese_bert_models.ipynb)|
| [Feature-extraction](https://github.com/taishi-i/nagisa_bert/blob/develop/notebooks/feature_extraction-japanese_bert_models.ipynb)  | How to use the pipeline function in transformers to extract features from Japanese text. |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/taishi-i/nagisa_bert/blob/develop/notebooks/feature_extraction-japanese_bert_models.ipynb)|
| [Embedding visualization](https://github.com/taishi-i/nagisa_bert/blob/develop/notebooks/embedding_visualization-japanese_bert_models.ipynb)  | Show how to visualize embeddings from Japanese pre-trained models. |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/taishi-i/nagisa_bert/blob/develop/notebooks/embedding_visualization_japanese_bert_models.ipynb)|
| [How to fine-tune a model on text classification](https://github.com/taishi-i/nagisa_bert/blob/develop/notebooks/text_classification-amazon_reviews_ja.ipynb)  | Show how to fine-tune a pretrained model on a Japanese text classification task. |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/taishi-i/nagisa_bert/blob/develop/notebooks/text_classification-amazon_reviews_ja.ipynb)|
| [How to fine-tune a model on text classification with csv files](https://github.com/taishi-i/nagisa_bert/blob/develop/notebooks/text_classification-csv_files.ipynb)  | Show how to preprocess the data and fine-tune a pretrained model on a Japanese text classification task. |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/taishi-i/nagisa_bert/blob/develop/notebooks/text_classification-csv_files.ipynb)|


## Model description

### Architecture

The model architecture is the same as [the BERT bert-base-uncased architecture](https://huggingface.co/bert-base-uncased) (12 layers, 768 dimensions of hidden states, and 12 attention heads).

### Training Data

The models is trained on the Japanese version of Wikipedia. The training corpus is generated from the Wikipedia Cirrussearch dump file as of August 8, 2022 with [make_corpus_wiki.py](https://github.com/cl-tohoku/bert-japanese/blob/main/make_corpus_wiki.py) and [create_pretraining_data.py](https://github.com/cl-tohoku/bert-japanese/blob/main/create_pretraining_data.py).

### Training

The model is trained with the default parameters of [transformers.BertConfig](https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertConfig).
Due to GPU memory limitations, the batch size is set to small; 16 instances per batch, and 2M training steps.



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/taishi-i/nagisa_bert",
    "name": "nagisa-bert",
    "maintainer": "taishi-i",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "taishi.ikeda.0323@gmail.com",
    "keywords": "NLP,BERT,Transformers,Japanese",
    "author": "taishi-i",
    "author_email": "taishi.ikeda.0323@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/3c/ee/4d9590daf6d4410fe4db8333127ea12e017d1a2dd73724b18ef67d804249/nagisa_bert-0.0.4.tar.gz",
    "platform": null,
    "description": "# nagisa_bert\n\n[![Python package](https://github.com/taishi-i/nagisa_bert/actions/workflows/python-package.yml/badge.svg)](https://github.com/taishi-i/nagisa_bert/actions/workflows/python-package.yml)\n[![PyPI version](https://badge.fury.io/py/nagisa_bert.svg)](https://badge.fury.io/py/nagisa_bert)\n\nThis library provides a tokenizer to use [a Japanese BERT model](https://huggingface.co/taishi-i/nagisa_bert) for [nagisa](https://github.com/taishi-i/nagisa).\nThe model is available in [Transformers](https://github.com/huggingface/transformers) \ud83e\udd17.\n\nYou can try fill-mask using nagisa_bert at [Hugging Face Space](https://huggingface.co/spaces/taishi-i/nagisa_bert-fill_mask).\n\n\n## Install\n\nPython 3.7+ on Linux or macOS is required.\nYou can install *nagisa_bert* by using the *pip* command.\n\n\n```bash\n$ pip install nagisa_bert\n```\n\n## Usage\n\nThis model is available in Transformer's pipeline method.\n\n```python\nfrom transformers import pipeline\nfrom nagisa_bert import NagisaBertTokenizer\n\ntext = \"nagisa\u3067[MASK]\u3067\u304d\u308b\u30e2\u30c7\u30eb\u3067\u3059\"\ntokenizer = NagisaBertTokenizer.from_pretrained(\"taishi-i/nagisa_bert\")\nfill_mask = pipeline(\"fill-mask\", model='taishi-i/nagisa_bert', tokenizer=tokenizer)\nprint(fill_mask(text))\n```\n\n```python\n[{'score': 0.1385931372642517,\n  'sequence': 'nagisa \u3067 \u4f7f\u7528 \u3067\u304d\u308b \u30e2\u30c7\u30eb \u3067\u3059',\n  'token': 8092,\n  'token_str': '\u4f7f \u7528'},\n {'score': 0.11947669088840485,\n  'sequence': 'nagisa \u3067 \u5229\u7528 \u3067\u304d\u308b \u30e2\u30c7\u30eb \u3067\u3059',\n  'token': 8252,\n  'token_str': '\u5229 \u7528'},\n {'score': 0.04910655692219734,\n  'sequence': 'nagisa \u3067 \u4f5c\u6210 \u3067\u304d\u308b \u30e2\u30c7\u30eb \u3067\u3059',\n  'token': 9559,\n  'token_str': '\u4f5c \u6210'},\n {'score': 0.03792576864361763,\n  'sequence': 'nagisa \u3067 \u8cfc\u5165 \u3067\u304d\u308b \u30e2\u30c7\u30eb \u3067\u3059',\n  'token': 9430,\n  'token_str': '\u8cfc \u5165'},\n {'score': 0.026893319562077522,\n  'sequence': 'nagisa \u3067 \u5165\u624b \u3067\u304d\u308b \u30e2\u30c7\u30eb \u3067\u3059',\n  'token': 11273,\n  'token_str': '\u5165 \u624b'}]\n```\n\nTokenization and vectorization.\n\n```python\nfrom transformers import BertModel\nfrom nagisa_bert import NagisaBertTokenizer\n\ntext = \"nagisa\u3067[MASK]\u3067\u304d\u308b\u30e2\u30c7\u30eb\u3067\u3059\"\ntokenizer = NagisaBertTokenizer.from_pretrained(\"taishi-i/nagisa_bert\")\ntokens = tokenizer.tokenize(text)\nprint(tokens)\n# ['na', '##g', '##is', '##a', '\u3067', '[MASK]', '\u3067\u304d\u308b', '\u30e2\u30c7\u30eb', '\u3067\u3059']\n\nmodel = BertModel.from_pretrained(\"taishi-i/nagisa_bert\")\nh = model(**tokenizer(text, return_tensors=\"pt\")).last_hidden_state\nprint(h)\n```\n\n```python\ntensor([[[-0.2912, -0.6818, -0.4097,  ...,  0.0262, -0.3845,  0.5816],\n         [ 0.2504,  0.2143,  0.5809,  ..., -0.5428,  1.1805,  1.8701],\n         [ 0.1890, -0.5816, -0.5469,  ..., -1.2081, -0.2341,  1.0215],\n         ...,\n         [-0.4360, -0.2546, -0.2824,  ...,  0.7420, -0.2904,  0.3070],\n         [-0.6598, -0.7607,  0.0034,  ...,  0.2982,  0.5126,  1.1403],\n         [-0.2505, -0.6574, -0.0523,  ...,  0.9082,  0.5851,  1.2625]]],\n       grad_fn=<NativeLayerNormBackward0>)\n```\n\n## Tutorial\n\nYou can find here a list of the notebooks on Japanese NLP using pre-trained models and transformers.\n\n\n| Notebook     |      Description      |   |\n|:----------|:-------------|------:|\n| [Fill-mask](https://github.com/taishi-i/nagisa_bert/blob/develop/notebooks/fill_mask-japanese_bert_models.ipynb)  | How to use the pipeline function in transformers to fill in Japanese text. |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/taishi-i/nagisa_bert/blob/develop/notebooks/fill_mask-japanese_bert_models.ipynb)|\n| [Feature-extraction](https://github.com/taishi-i/nagisa_bert/blob/develop/notebooks/feature_extraction-japanese_bert_models.ipynb)  | How to use the pipeline function in transformers to extract features from Japanese text. |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/taishi-i/nagisa_bert/blob/develop/notebooks/feature_extraction-japanese_bert_models.ipynb)|\n| [Embedding visualization](https://github.com/taishi-i/nagisa_bert/blob/develop/notebooks/embedding_visualization-japanese_bert_models.ipynb)  | Show how to visualize embeddings from Japanese pre-trained models. |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/taishi-i/nagisa_bert/blob/develop/notebooks/embedding_visualization_japanese_bert_models.ipynb)|\n| [How to fine-tune a model on text classification](https://github.com/taishi-i/nagisa_bert/blob/develop/notebooks/text_classification-amazon_reviews_ja.ipynb)  | Show how to fine-tune a pretrained model on a Japanese text classification task. |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/taishi-i/nagisa_bert/blob/develop/notebooks/text_classification-amazon_reviews_ja.ipynb)|\n| [How to fine-tune a model on text classification with csv files](https://github.com/taishi-i/nagisa_bert/blob/develop/notebooks/text_classification-csv_files.ipynb)  | Show how to preprocess the data and fine-tune a pretrained model on a Japanese text classification task. |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/taishi-i/nagisa_bert/blob/develop/notebooks/text_classification-csv_files.ipynb)|\n\n\n## Model description\n\n### Architecture\n\nThe model architecture is the same as [the BERT bert-base-uncased architecture](https://huggingface.co/bert-base-uncased) (12 layers, 768 dimensions of hidden states, and 12 attention heads).\n\n### Training Data\n\nThe models is trained on the Japanese version of Wikipedia. The training corpus is generated from the Wikipedia Cirrussearch dump file as of August 8, 2022 with [make_corpus_wiki.py](https://github.com/cl-tohoku/bert-japanese/blob/main/make_corpus_wiki.py) and [create_pretraining_data.py](https://github.com/cl-tohoku/bert-japanese/blob/main/create_pretraining_data.py).\n\n### Training\n\nThe model is trained with the default parameters of [transformers.BertConfig](https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertConfig).\nDue to GPU memory limitations, the batch size is set to small; 16 instances per batch, and 2M training steps.\n\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A BERT model for nagisa",
    "version": "0.0.4",
    "project_urls": {
        "Homepage": "https://github.com/taishi-i/nagisa_bert"
    },
    "split_keywords": [
        "nlp",
        "bert",
        "transformers",
        "japanese"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ade3dc9ff6d40b8430d1c453bfa8d2717845f1049f01cb828e781ee9490b9970",
                "md5": "b47efd2b48904a0d594b716d87484ad5",
                "sha256": "5d18fa156c73bd75b7371e25cfe1ec46dc4ffc45492cb0ab37d833203a41a82f"
            },
            "downloads": -1,
            "filename": "nagisa_bert-0.0.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "b47efd2b48904a0d594b716d87484ad5",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 9556,
            "upload_time": "2023-12-23T07:24:37",
            "upload_time_iso_8601": "2023-12-23T07:24:37.773766Z",
            "url": "https://files.pythonhosted.org/packages/ad/e3/dc9ff6d40b8430d1c453bfa8d2717845f1049f01cb828e781ee9490b9970/nagisa_bert-0.0.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3cee4d9590daf6d4410fe4db8333127ea12e017d1a2dd73724b18ef67d804249",
                "md5": "984539f10642ec532489d567ef81837e",
                "sha256": "3285de369bbad15a622a580903f474c29cd72d79d2ecce83c5dc9bb4f78538ee"
            },
            "downloads": -1,
            "filename": "nagisa_bert-0.0.4.tar.gz",
            "has_sig": false,
            "md5_digest": "984539f10642ec532489d567ef81837e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 9930,
            "upload_time": "2023-12-23T07:24:39",
            "upload_time_iso_8601": "2023-12-23T07:24:39.840728Z",
            "url": "https://files.pythonhosted.org/packages/3c/ee/4d9590daf6d4410fe4db8333127ea12e017d1a2dd73724b18ef67d804249/nagisa_bert-0.0.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-12-23 07:24:39",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "taishi-i",
    "github_project": "nagisa_bert",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "nagisa-bert"
}
        
Elapsed time: 0.15053s