# nagisa_bert
[![Python package](https://github.com/taishi-i/nagisa_bert/actions/workflows/python-package.yml/badge.svg)](https://github.com/taishi-i/nagisa_bert/actions/workflows/python-package.yml)
[![PyPI version](https://badge.fury.io/py/nagisa_bert.svg)](https://badge.fury.io/py/nagisa_bert)
This library provides a tokenizer to use [a Japanese BERT model](https://huggingface.co/taishi-i/nagisa_bert) for [nagisa](https://github.com/taishi-i/nagisa).
The model is available in [Transformers](https://github.com/huggingface/transformers) π€.
You can try fill-mask using nagisa_bert at [Hugging Face Space](https://huggingface.co/spaces/taishi-i/nagisa_bert-fill_mask).
## Install
Python 3.7+ on Linux or macOS is required.
You can install *nagisa_bert* by using the *pip* command.
```bash
$ pip install nagisa_bert
```
## Usage
This model is available in Transformer's pipeline method.
```python
from transformers import pipeline
from nagisa_bert import NagisaBertTokenizer
text = "nagisaγ§[MASK]γ§γγγ’γγ«γ§γ"
tokenizer = NagisaBertTokenizer.from_pretrained("taishi-i/nagisa_bert")
fill_mask = pipeline("fill-mask", model='taishi-i/nagisa_bert', tokenizer=tokenizer)
print(fill_mask(text))
```
```python
[{'score': 0.1385931372642517,
'sequence': 'nagisa γ§ δ½Ώη¨ γ§γγ γ’γγ« γ§γ',
'token': 8092,
'token_str': 'δ½Ώ η¨'},
{'score': 0.11947669088840485,
'sequence': 'nagisa γ§ ε©η¨ γ§γγ γ’γγ« γ§γ',
'token': 8252,
'token_str': 'ε© η¨'},
{'score': 0.04910655692219734,
'sequence': 'nagisa γ§ δ½ζ γ§γγ γ’γγ« γ§γ',
'token': 9559,
'token_str': 'δ½ ζ'},
{'score': 0.03792576864361763,
'sequence': 'nagisa γ§ θ³Όε
₯ γ§γγ γ’γγ« γ§γ',
'token': 9430,
'token_str': 'θ³Ό ε
₯'},
{'score': 0.026893319562077522,
'sequence': 'nagisa γ§ ε
₯ζ γ§γγ γ’γγ« γ§γ',
'token': 11273,
'token_str': 'ε
₯ ζ'}]
```
Tokenization and vectorization.
```python
from transformers import BertModel
from nagisa_bert import NagisaBertTokenizer
text = "nagisaγ§[MASK]γ§γγγ’γγ«γ§γ"
tokenizer = NagisaBertTokenizer.from_pretrained("taishi-i/nagisa_bert")
tokens = tokenizer.tokenize(text)
print(tokens)
# ['na', '##g', '##is', '##a', 'γ§', '[MASK]', 'γ§γγ', 'γ’γγ«', 'γ§γ']
model = BertModel.from_pretrained("taishi-i/nagisa_bert")
h = model(**tokenizer(text, return_tensors="pt")).last_hidden_state
print(h)
```
```python
tensor([[[-0.2912, -0.6818, -0.4097, ..., 0.0262, -0.3845, 0.5816],
[ 0.2504, 0.2143, 0.5809, ..., -0.5428, 1.1805, 1.8701],
[ 0.1890, -0.5816, -0.5469, ..., -1.2081, -0.2341, 1.0215],
...,
[-0.4360, -0.2546, -0.2824, ..., 0.7420, -0.2904, 0.3070],
[-0.6598, -0.7607, 0.0034, ..., 0.2982, 0.5126, 1.1403],
[-0.2505, -0.6574, -0.0523, ..., 0.9082, 0.5851, 1.2625]]],
grad_fn=<NativeLayerNormBackward0>)
```
## Tutorial
You can find here a list of the notebooks on Japanese NLP using pre-trained models and transformers.
| Notebook | Description | |
|:----------|:-------------|------:|
| [Fill-mask](https://github.com/taishi-i/nagisa_bert/blob/develop/notebooks/fill_mask-japanese_bert_models.ipynb) | How to use the pipeline function in transformers to fill in Japanese text. |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/taishi-i/nagisa_bert/blob/develop/notebooks/fill_mask-japanese_bert_models.ipynb)|
| [Feature-extraction](https://github.com/taishi-i/nagisa_bert/blob/develop/notebooks/feature_extraction-japanese_bert_models.ipynb) | How to use the pipeline function in transformers to extract features from Japanese text. |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/taishi-i/nagisa_bert/blob/develop/notebooks/feature_extraction-japanese_bert_models.ipynb)|
| [Embedding visualization](https://github.com/taishi-i/nagisa_bert/blob/develop/notebooks/embedding_visualization-japanese_bert_models.ipynb) | Show how to visualize embeddings from Japanese pre-trained models. |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/taishi-i/nagisa_bert/blob/develop/notebooks/embedding_visualization_japanese_bert_models.ipynb)|
| [How to fine-tune a model on text classification](https://github.com/taishi-i/nagisa_bert/blob/develop/notebooks/text_classification-amazon_reviews_ja.ipynb) | Show how to fine-tune a pretrained model on a Japanese text classification task. |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/taishi-i/nagisa_bert/blob/develop/notebooks/text_classification-amazon_reviews_ja.ipynb)|
| [How to fine-tune a model on text classification with csv files](https://github.com/taishi-i/nagisa_bert/blob/develop/notebooks/text_classification-csv_files.ipynb) | Show how to preprocess the data and fine-tune a pretrained model on a Japanese text classification task. |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/taishi-i/nagisa_bert/blob/develop/notebooks/text_classification-csv_files.ipynb)|
## Model description
### Architecture
The model architecture is the same as [the BERT bert-base-uncased architecture](https://huggingface.co/bert-base-uncased) (12 layers, 768 dimensions of hidden states, and 12 attention heads).
### Training Data
The models is trained on the Japanese version of Wikipedia. The training corpus is generated from the Wikipedia Cirrussearch dump file as of August 8, 2022 with [make_corpus_wiki.py](https://github.com/cl-tohoku/bert-japanese/blob/main/make_corpus_wiki.py) and [create_pretraining_data.py](https://github.com/cl-tohoku/bert-japanese/blob/main/create_pretraining_data.py).
### Training
The model is trained with the default parameters of [transformers.BertConfig](https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertConfig).
Due to GPU memory limitations, the batch size is set to small; 16 instances per batch, and 2M training steps.
Raw data
{
"_id": null,
"home_page": "https://github.com/taishi-i/nagisa_bert",
"name": "nagisa-bert",
"maintainer": "taishi-i",
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": "taishi.ikeda.0323@gmail.com",
"keywords": "NLP,BERT,Transformers,Japanese",
"author": "taishi-i",
"author_email": "taishi.ikeda.0323@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/3c/ee/4d9590daf6d4410fe4db8333127ea12e017d1a2dd73724b18ef67d804249/nagisa_bert-0.0.4.tar.gz",
"platform": null,
"description": "# nagisa_bert\n\n[![Python package](https://github.com/taishi-i/nagisa_bert/actions/workflows/python-package.yml/badge.svg)](https://github.com/taishi-i/nagisa_bert/actions/workflows/python-package.yml)\n[![PyPI version](https://badge.fury.io/py/nagisa_bert.svg)](https://badge.fury.io/py/nagisa_bert)\n\nThis library provides a tokenizer to use [a Japanese BERT model](https://huggingface.co/taishi-i/nagisa_bert) for [nagisa](https://github.com/taishi-i/nagisa).\nThe model is available in [Transformers](https://github.com/huggingface/transformers) \ud83e\udd17.\n\nYou can try fill-mask using nagisa_bert at [Hugging Face Space](https://huggingface.co/spaces/taishi-i/nagisa_bert-fill_mask).\n\n\n## Install\n\nPython 3.7+ on Linux or macOS is required.\nYou can install *nagisa_bert* by using the *pip* command.\n\n\n```bash\n$ pip install nagisa_bert\n```\n\n## Usage\n\nThis model is available in Transformer's pipeline method.\n\n```python\nfrom transformers import pipeline\nfrom nagisa_bert import NagisaBertTokenizer\n\ntext = \"nagisa\u3067[MASK]\u3067\u304d\u308b\u30e2\u30c7\u30eb\u3067\u3059\"\ntokenizer = NagisaBertTokenizer.from_pretrained(\"taishi-i/nagisa_bert\")\nfill_mask = pipeline(\"fill-mask\", model='taishi-i/nagisa_bert', tokenizer=tokenizer)\nprint(fill_mask(text))\n```\n\n```python\n[{'score': 0.1385931372642517,\n 'sequence': 'nagisa \u3067 \u4f7f\u7528 \u3067\u304d\u308b \u30e2\u30c7\u30eb \u3067\u3059',\n 'token': 8092,\n 'token_str': '\u4f7f \u7528'},\n {'score': 0.11947669088840485,\n 'sequence': 'nagisa \u3067 \u5229\u7528 \u3067\u304d\u308b \u30e2\u30c7\u30eb \u3067\u3059',\n 'token': 8252,\n 'token_str': '\u5229 \u7528'},\n {'score': 0.04910655692219734,\n 'sequence': 'nagisa \u3067 \u4f5c\u6210 \u3067\u304d\u308b \u30e2\u30c7\u30eb \u3067\u3059',\n 'token': 9559,\n 'token_str': '\u4f5c \u6210'},\n {'score': 0.03792576864361763,\n 'sequence': 'nagisa \u3067 \u8cfc\u5165 \u3067\u304d\u308b \u30e2\u30c7\u30eb \u3067\u3059',\n 'token': 9430,\n 'token_str': '\u8cfc \u5165'},\n {'score': 0.026893319562077522,\n 'sequence': 'nagisa \u3067 \u5165\u624b \u3067\u304d\u308b \u30e2\u30c7\u30eb \u3067\u3059',\n 'token': 11273,\n 'token_str': '\u5165 \u624b'}]\n```\n\nTokenization and vectorization.\n\n```python\nfrom transformers import BertModel\nfrom nagisa_bert import NagisaBertTokenizer\n\ntext = \"nagisa\u3067[MASK]\u3067\u304d\u308b\u30e2\u30c7\u30eb\u3067\u3059\"\ntokenizer = NagisaBertTokenizer.from_pretrained(\"taishi-i/nagisa_bert\")\ntokens = tokenizer.tokenize(text)\nprint(tokens)\n# ['na', '##g', '##is', '##a', '\u3067', '[MASK]', '\u3067\u304d\u308b', '\u30e2\u30c7\u30eb', '\u3067\u3059']\n\nmodel = BertModel.from_pretrained(\"taishi-i/nagisa_bert\")\nh = model(**tokenizer(text, return_tensors=\"pt\")).last_hidden_state\nprint(h)\n```\n\n```python\ntensor([[[-0.2912, -0.6818, -0.4097, ..., 0.0262, -0.3845, 0.5816],\n [ 0.2504, 0.2143, 0.5809, ..., -0.5428, 1.1805, 1.8701],\n [ 0.1890, -0.5816, -0.5469, ..., -1.2081, -0.2341, 1.0215],\n ...,\n [-0.4360, -0.2546, -0.2824, ..., 0.7420, -0.2904, 0.3070],\n [-0.6598, -0.7607, 0.0034, ..., 0.2982, 0.5126, 1.1403],\n [-0.2505, -0.6574, -0.0523, ..., 0.9082, 0.5851, 1.2625]]],\n grad_fn=<NativeLayerNormBackward0>)\n```\n\n## Tutorial\n\nYou can find here a list of the notebooks on Japanese NLP using pre-trained models and transformers.\n\n\n| Notebook | Description | |\n|:----------|:-------------|------:|\n| [Fill-mask](https://github.com/taishi-i/nagisa_bert/blob/develop/notebooks/fill_mask-japanese_bert_models.ipynb) | How to use the pipeline function in transformers to fill in Japanese text. |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/taishi-i/nagisa_bert/blob/develop/notebooks/fill_mask-japanese_bert_models.ipynb)|\n| [Feature-extraction](https://github.com/taishi-i/nagisa_bert/blob/develop/notebooks/feature_extraction-japanese_bert_models.ipynb) | How to use the pipeline function in transformers to extract features from Japanese text. |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/taishi-i/nagisa_bert/blob/develop/notebooks/feature_extraction-japanese_bert_models.ipynb)|\n| [Embedding visualization](https://github.com/taishi-i/nagisa_bert/blob/develop/notebooks/embedding_visualization-japanese_bert_models.ipynb) | Show how to visualize embeddings from Japanese pre-trained models. |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/taishi-i/nagisa_bert/blob/develop/notebooks/embedding_visualization_japanese_bert_models.ipynb)|\n| [How to fine-tune a model on text classification](https://github.com/taishi-i/nagisa_bert/blob/develop/notebooks/text_classification-amazon_reviews_ja.ipynb) | Show how to fine-tune a pretrained model on a Japanese text classification task. |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/taishi-i/nagisa_bert/blob/develop/notebooks/text_classification-amazon_reviews_ja.ipynb)|\n| [How to fine-tune a model on text classification with csv files](https://github.com/taishi-i/nagisa_bert/blob/develop/notebooks/text_classification-csv_files.ipynb) | Show how to preprocess the data and fine-tune a pretrained model on a Japanese text classification task. |[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/taishi-i/nagisa_bert/blob/develop/notebooks/text_classification-csv_files.ipynb)|\n\n\n## Model description\n\n### Architecture\n\nThe model architecture is the same as [the BERT bert-base-uncased architecture](https://huggingface.co/bert-base-uncased) (12 layers, 768 dimensions of hidden states, and 12 attention heads).\n\n### Training Data\n\nThe models is trained on the Japanese version of Wikipedia. The training corpus is generated from the Wikipedia Cirrussearch dump file as of August 8, 2022 with [make_corpus_wiki.py](https://github.com/cl-tohoku/bert-japanese/blob/main/make_corpus_wiki.py) and [create_pretraining_data.py](https://github.com/cl-tohoku/bert-japanese/blob/main/create_pretraining_data.py).\n\n### Training\n\nThe model is trained with the default parameters of [transformers.BertConfig](https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertConfig).\nDue to GPU memory limitations, the batch size is set to small; 16 instances per batch, and 2M training steps.\n\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A BERT model for nagisa",
"version": "0.0.4",
"project_urls": {
"Homepage": "https://github.com/taishi-i/nagisa_bert"
},
"split_keywords": [
"nlp",
"bert",
"transformers",
"japanese"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "ade3dc9ff6d40b8430d1c453bfa8d2717845f1049f01cb828e781ee9490b9970",
"md5": "b47efd2b48904a0d594b716d87484ad5",
"sha256": "5d18fa156c73bd75b7371e25cfe1ec46dc4ffc45492cb0ab37d833203a41a82f"
},
"downloads": -1,
"filename": "nagisa_bert-0.0.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "b47efd2b48904a0d594b716d87484ad5",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 9556,
"upload_time": "2023-12-23T07:24:37",
"upload_time_iso_8601": "2023-12-23T07:24:37.773766Z",
"url": "https://files.pythonhosted.org/packages/ad/e3/dc9ff6d40b8430d1c453bfa8d2717845f1049f01cb828e781ee9490b9970/nagisa_bert-0.0.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "3cee4d9590daf6d4410fe4db8333127ea12e017d1a2dd73724b18ef67d804249",
"md5": "984539f10642ec532489d567ef81837e",
"sha256": "3285de369bbad15a622a580903f474c29cd72d79d2ecce83c5dc9bb4f78538ee"
},
"downloads": -1,
"filename": "nagisa_bert-0.0.4.tar.gz",
"has_sig": false,
"md5_digest": "984539f10642ec532489d567ef81837e",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 9930,
"upload_time": "2023-12-23T07:24:39",
"upload_time_iso_8601": "2023-12-23T07:24:39.840728Z",
"url": "https://files.pythonhosted.org/packages/3c/ee/4d9590daf6d4410fe4db8333127ea12e017d1a2dd73724b18ef67d804249/nagisa_bert-0.0.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-12-23 07:24:39",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "taishi-i",
"github_project": "nagisa_bert",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "nagisa-bert"
}