# ThaiXtransformers
<a target="_blank" href="https://colab.research.google.com/github/PyThaiNLP/thaixtransformers/blob/main/notebooks/wangchanberta_getting_started_aireseach.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
**Use Pretraining RoBERTa based Thai language models from VISTEC-depa AI Research Institute of Thailand.**
Fork from [vistec-AI/thai2transformers](https://github.com/vistec-AI/thai2transformers).
This project build the tokenizer and preprocessing data for RoBERTa models from VISTEC-depa AI Research Institute of Thailand.
Paper: [WangchanBERTa: Pretraining transformer-based Thai Language Models](https://arxiv.org/abs/2101.09635)
## Install
> pip install thaixtransformers
## Usage
### Tokenizer
> from thaixtransformers import Tokenizer
If you use models, you should load model by model name.
> Tokenizer(model_name) -> Tokeinzer
**Example**
```python
from thaixtransformers import Tokenizer
from transformers import pipeline
from transformers import AutoModelForMaskedLM
tokenizer = Tokenizer("airesearch/wangchanberta-base-wiki-newmm")
model = AutoModelForMaskedLM.from_pretrained("airesearch/wangchanberta-base-wiki-newmm")
classifier = pipeline("fill-mask",model=model,tokenizer=tokenizer)
print(classifier("ผมชอบ<mask>มาก ๆ"))
# output:
# [{'score': 0.05261131376028061,
# 'token': 6052,
# 'token_str': 'อินเทอร์เน็ต',
# 'sequence': 'ผมชอบอินเทอร์เน็ตมากๆ'},
# {'score': 0.03980604186654091,
# 'token': 11893,
# 'token_str': 'อ่านหนังสือ',
# 'sequence': 'ผมชอบอ่านหนังสือมากๆ'},
# ...]
```
### Preprocess
If you want to preprocessing data before training model, you can use preprocess.
> from thaixtransformers.preprocess import process_transformers
> process_transformers(str) -> str
**Example**
```python
from thaixtransformers.preprocess import process_transformers
print(process_transformers("สวัสดี :D"))
# output: 'สวัสดี<_>:d'
```
## BibTeX entry and citation info
```
@misc{lowphansirikul2021wangchanberta,
title={WangchanBERTa: Pretraining transformer-based Thai Language Models},
author={Lalita Lowphansirikul and Charin Polpanumas and Nawat Jantrakulchai and Sarana Nutanong},
year={2021},
eprint={2101.09635},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
Raw data
{
"_id": null,
"home_page": "https://github.com/pythainlp/thaixtransformers",
"name": "thaixtransformers",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "",
"keywords": "thainlp,NLP,natural language processing,text analytics,text processing,localization,computational linguistics,ThaiNLP,Thai NLP,Thai language",
"author": "PyThaiNLP",
"author_email": "",
"download_url": "https://files.pythonhosted.org/packages/15/eb/7056a9ef57cdce0c35e914d7cfeb63facfa570e4ba84e93386df52880c19/thaixtransformers-0.1.0.tar.gz",
"platform": null,
"description": "# ThaiXtransformers\n\n<a target=\"_blank\" href=\"https://colab.research.google.com/github/PyThaiNLP/thaixtransformers/blob/main/notebooks/wangchanberta_getting_started_aireseach.ipynb\">\n <img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>\n</a>\n\n\n**Use Pretraining RoBERTa based Thai language models from VISTEC-depa AI Research Institute of Thailand.**\n\nFork from [vistec-AI/thai2transformers](https://github.com/vistec-AI/thai2transformers).\n\n\nThis project build the tokenizer and preprocessing data for RoBERTa models from VISTEC-depa AI Research Institute of Thailand.\n\nPaper: [WangchanBERTa: Pretraining transformer-based Thai Language Models](https://arxiv.org/abs/2101.09635)\n\n\n## Install\n\n> pip install thaixtransformers\n\n## Usage\n\n### Tokenizer\n\n> from thaixtransformers import Tokenizer\n\nIf you use models, you should load model by model name.\n\n> Tokenizer(model_name) -> Tokeinzer\n\n**Example**\n\n```python\nfrom thaixtransformers import Tokenizer\nfrom transformers import pipeline\nfrom transformers import AutoModelForMaskedLM\n\ntokenizer = Tokenizer(\"airesearch/wangchanberta-base-wiki-newmm\")\nmodel = AutoModelForMaskedLM.from_pretrained(\"airesearch/wangchanberta-base-wiki-newmm\")\n\nclassifier = pipeline(\"fill-mask\",model=model,tokenizer=tokenizer)\nprint(classifier(\"\u0e1c\u0e21\u0e0a\u0e2d\u0e1a<mask>\u0e21\u0e32\u0e01 \u0e46\"))\n# output:\n# [{'score': 0.05261131376028061,\n# 'token': 6052,\n# 'token_str': '\u0e2d\u0e34\u0e19\u0e40\u0e17\u0e2d\u0e23\u0e4c\u0e40\u0e19\u0e47\u0e15',\n# 'sequence': '\u0e1c\u0e21\u0e0a\u0e2d\u0e1a\u0e2d\u0e34\u0e19\u0e40\u0e17\u0e2d\u0e23\u0e4c\u0e40\u0e19\u0e47\u0e15\u0e21\u0e32\u0e01\u0e46'},\n# {'score': 0.03980604186654091,\n# 'token': 11893,\n# 'token_str': '\u0e2d\u0e48\u0e32\u0e19\u0e2b\u0e19\u0e31\u0e07\u0e2a\u0e37\u0e2d',\n# 'sequence': '\u0e1c\u0e21\u0e0a\u0e2d\u0e1a\u0e2d\u0e48\u0e32\u0e19\u0e2b\u0e19\u0e31\u0e07\u0e2a\u0e37\u0e2d\u0e21\u0e32\u0e01\u0e46'},\n# ...]\n```\n\n### Preprocess\n\nIf you want to preprocessing data before training model, you can use preprocess.\n\n> from thaixtransformers.preprocess import process_transformers\n\n> process_transformers(str) -> str\n\n**Example**\n\n```python\nfrom thaixtransformers.preprocess import process_transformers\n\nprint(process_transformers(\"\u0e2a\u0e27\u0e31\u0e2a\u0e14\u0e35 :D\"))\n# output: '\u0e2a\u0e27\u0e31\u0e2a\u0e14\u0e35<_>:d'\n```\n\n\n## BibTeX entry and citation info\n\n```\n@misc{lowphansirikul2021wangchanberta,\n title={WangchanBERTa: Pretraining transformer-based Thai Language Models}, \n author={Lalita Lowphansirikul and Charin Polpanumas and Nawat Jantrakulchai and Sarana Nutanong},\n year={2021},\n eprint={2101.09635},\n archivePrefix={arXiv},\n primaryClass={cs.CL}\n}\n```\n",
"bugtrack_url": null,
"license": "Apache Software License 2.0",
"summary": "ThaiXtransformers: Use Pretraining RoBERTa based Thai language models from VISTEC-depa AI Research Institute of Thailand.",
"version": "0.1.0",
"project_urls": {
"Bug Tracker": "https://github.com/pythainlp/thaixtransformers/issues",
"Documentation": "https://github.com/pythainlp/thaixtransformers",
"Homepage": "https://github.com/pythainlp/thaixtransformers",
"Source Code": "https://github.com/pythainlp/thaixtransformers"
},
"split_keywords": [
"thainlp",
"nlp",
"natural language processing",
"text analytics",
"text processing",
"localization",
"computational linguistics",
"thainlp",
"thai nlp",
"thai language"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "fa5988774229cada59628e851420096dc5779009c1511696f23fe78225c3d384",
"md5": "440a757f48080f4b0036719cab7b9d00",
"sha256": "b9373c75458075f0c534ed4ab422e5490872cc31595d6c3a2aff42ebbd4d8350"
},
"downloads": -1,
"filename": "thaixtransformers-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "440a757f48080f4b0036719cab7b9d00",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 19083,
"upload_time": "2023-07-07T11:40:41",
"upload_time_iso_8601": "2023-07-07T11:40:41.725350Z",
"url": "https://files.pythonhosted.org/packages/fa/59/88774229cada59628e851420096dc5779009c1511696f23fe78225c3d384/thaixtransformers-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "15eb7056a9ef57cdce0c35e914d7cfeb63facfa570e4ba84e93386df52880c19",
"md5": "73b7f63e0efbc69c927f68587fe64cde",
"sha256": "bb8b49dc0660baf92e17c483e6b42346dee2df357ddc397ac5e4c12b5806443a"
},
"downloads": -1,
"filename": "thaixtransformers-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "73b7f63e0efbc69c927f68587fe64cde",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 18005,
"upload_time": "2023-07-07T11:40:43",
"upload_time_iso_8601": "2023-07-07T11:40:43.289521Z",
"url": "https://files.pythonhosted.org/packages/15/eb/7056a9ef57cdce0c35e914d7cfeb63facfa570e4ba84e93386df52880c19/thaixtransformers-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-07-07 11:40:43",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "pythainlp",
"github_project": "thaixtransformers",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "thaixtransformers"
}