# This is Data Augmentation for Chinese text for Python3
## Usage
### you have two func for Chinese text Data Augmentation
### Install textda
pip install:
```bash
pip install textda
```
1. you can expansion data use **data_expansion**
```python
from textda.data_expansion import *
print(data_expansion('生活里的惬意,无需等到春暖花开'))
```
output:
```python
['生活里面的惬意,无需等到春暖花开',
'生活里的等到春暖花开',
'生活里无需惬意,的等到春暖花开',
'生活里的惬意,无需等到春暖花开',
'生活里的惬意,并不需要等到春暖花开',
'生活无需的惬意,里等到春暖花开',
'生活里的惬意,等到无需春暖花开']
```
param explain:
:param sentence: input sentence text
:param alpha_sr: Replace synonym control param. bigger means more words are Replace
:param alpha_ri: Random insert. bigger means more words are Insert
:param alpha_rs: Random swap. bigger means more words are swap
:param p_rd: Random delete. bigger means more words are deleted
:param num_aug: How many times do you repeat each method
- you can use parameters alpha_sr, alpha_ri, alpha_rs, p_rd, num_aug can control ouput.
if you set alpha_ri and alpha_rs is 0 that means use **linear classifier** for it, and insensitive to word location
like this:
```python
from textda.data_expansion import *
print(data_expansion('生活里的惬意,无需等到春暖花开', alpha_ri=0, alpha_rs=0))
```
output:
```python
['生活里的惬意,无需等到春暖花开',
',无需春暖花开',
'生活里面的惬意,无需等到春暖花开',
'生活里的惬意,需等到春暖花开']
```
2. you can use **translate_batch** like this:
```python
from textda.youdao_translate import *
dir = './data'
translate_batch(os.path.join(dir, 'insurance_train'), batch_num=30)
```
```
# translate results: chinese->english and english -> chinese
颜色碰掉了一个角不延迟,但事情或他们不赠送,或发送,眉笔打开已经破碎,磨山楂,也不打破一只手,轻轻刷掉,持久性不长,
这个用户没有填写评价内容
颜色非常不喜欢它
不说话,缓慢的新领域
不太容易染好骑吗
不是很好我喜欢!
没有颜色的眼影
应该有大礼物盒眼影,礼物不礼物盒,没有一起破碎粉碎好的眼影不买礼物清洁剂脏就像商品是压力
没有生产日期,我不知道是否真实,总是觉得有点奇怪
是一个小飞粉吗
但是一些混合的颜色
有几次,现在这个东西,笔是空的
眼影有点小,少一点。
不好的颜色,粉红色
明星不想买,坏了,不容易,不要在乎太多!
一开始我已经联系快递,快递一直拖,说他将返回将联系快递服务
画不是,是不好的
物理和照片有很大的区别
不要把眼影刷不是很方便
感觉好干,颜色更暗
打破了在运输途中,有点太脆弱…
盒子有点坏了,还没有发送。
```
param explain:
:param file_path: src file path
:param batch_num: default 30
:param reWrite: default True. means you can rewrite file , False means you can append data after this file.
:param suffix: new file suffix
## Reference:
https://github.com/jasonwei20/eda_nlp
Code for the ICLR 2019 Workshop paper: Easy data augmentation techniques for boosting performance on text classification tasks. https://arxiv.org/abs/1901.11196
## License
[MIT](./LICENSE)
Raw data
{
"_id": null,
"home_page": "https://github.com/wac81/textda",
"name": "textda",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "classification,expansion,augmentation,addition,data,text,chinese",
"author": "wac",
"author_email": "wuanch@gmail.com",
"download_url": "",
"platform": "",
"description": "# This is Data Augmentation for Chinese text for Python3\n\n## Usage\n### you have two func for Chinese text Data Augmentation \n\n### Install textda\npip install:\n\n```bash\npip install textda\n```\n\n1. you can expansion data use **data_expansion**\n```python\nfrom textda.data_expansion import *\nprint(data_expansion('\u751f\u6d3b\u91cc\u7684\u60ec\u610f\uff0c\u65e0\u9700\u7b49\u5230\u6625\u6696\u82b1\u5f00')) \n\n```\noutput:\n\n```python\n['\u751f\u6d3b\u91cc\u9762\u7684\u60ec\u610f\uff0c\u65e0\u9700\u7b49\u5230\u6625\u6696\u82b1\u5f00', \n'\u751f\u6d3b\u91cc\u7684\u7b49\u5230\u6625\u6696\u82b1\u5f00',\n'\u751f\u6d3b\u91cc\u65e0\u9700\u60ec\u610f\uff0c\u7684\u7b49\u5230\u6625\u6696\u82b1\u5f00', \n'\u751f\u6d3b\u91cc\u7684\u60ec\u610f\uff0c\u65e0\u9700\u7b49\u5230\u6625\u6696\u82b1\u5f00', \n'\u751f\u6d3b\u91cc\u7684\u60ec\u610f\uff0c\u5e76\u4e0d\u9700\u8981\u7b49\u5230\u6625\u6696\u82b1\u5f00', \n'\u751f\u6d3b\u65e0\u9700\u7684\u60ec\u610f\uff0c\u91cc\u7b49\u5230\u6625\u6696\u82b1\u5f00', \n'\u751f\u6d3b\u91cc\u7684\u60ec\u610f\uff0c\u7b49\u5230\u65e0\u9700\u6625\u6696\u82b1\u5f00']\n\n```\n\nparam explain\uff1a\n\n :param sentence: input sentence text\n :param alpha_sr: Replace synonym control param. bigger means more words are Replace\n :param alpha_ri: Random insert. bigger means more words are Insert\n :param alpha_rs: Random swap. bigger means more words are swap\n :param p_rd: Random delete. bigger means more words are deleted\n :param num_aug: How many times do you repeat each method\n\n- you can use parameters alpha_sr, alpha_ri, alpha_rs, p_rd, num_aug can control ouput.\n\n if you set alpha_ri and alpha_rs is 0 that means use **linear classifier** for it, and insensitive to word location\n\n like this:\n ```python\n\n from textda.data_expansion import *\n\n print(data_expansion('\u751f\u6d3b\u91cc\u7684\u60ec\u610f\uff0c\u65e0\u9700\u7b49\u5230\u6625\u6696\u82b1\u5f00', alpha_ri=0, alpha_rs=0))\n\n ```\n output:\n\n ```python\n ['\u751f\u6d3b\u91cc\u7684\u60ec\u610f\uff0c\u65e0\u9700\u7b49\u5230\u6625\u6696\u82b1\u5f00', \n '\uff0c\u65e0\u9700\u6625\u6696\u82b1\u5f00', \n '\u751f\u6d3b\u91cc\u9762\u7684\u60ec\u610f\uff0c\u65e0\u9700\u7b49\u5230\u6625\u6696\u82b1\u5f00', \n '\u751f\u6d3b\u91cc\u7684\u60ec\u610f\uff0c\u9700\u7b49\u5230\u6625\u6696\u82b1\u5f00']\n\n ```\n\n\n\n2. you can use **translate_batch** like this:\n\n```python\nfrom textda.youdao_translate import *\ndir = './data'\ntranslate_batch(os.path.join(dir, 'insurance_train'), batch_num=30)\n\n```\n\n```\n# translate results: chinese->english and english -> chinese\n\n\u989c\u8272\u78b0\u6389\u4e86\u4e00\u4e2a\u89d2\u4e0d\u5ef6\u8fdf,\u4f46\u4e8b\u60c5\u6216\u4ed6\u4eec\u4e0d\u8d60\u9001,\u6216\u53d1\u9001,\u7709\u7b14\u6253\u5f00\u5df2\u7ecf\u7834\u788e,\u78e8\u5c71\u6942,\u4e5f\u4e0d\u6253\u7834\u4e00\u53ea\u624b,\u8f7b\u8f7b\u5237\u6389,\u6301\u4e45\u6027\u4e0d\u957f,\n\u8fd9\u4e2a\u7528\u6237\u6ca1\u6709\u586b\u5199\u8bc4\u4ef7\u5185\u5bb9\n\u989c\u8272\u975e\u5e38\u4e0d\u559c\u6b22\u5b83\n\u4e0d\u8bf4\u8bdd,\u7f13\u6162\u7684\u65b0\u9886\u57df\n\u4e0d\u592a\u5bb9\u6613\u67d3\u597d\u9a91\u5417\n\u4e0d\u662f\u5f88\u597d\u6211\u559c\u6b22!\n\u6ca1\u6709\u989c\u8272\u7684\u773c\u5f71\n\u5e94\u8be5\u6709\u5927\u793c\u7269\u76d2\u773c\u5f71,\u793c\u7269\u4e0d\u793c\u7269\u76d2,\u6ca1\u6709\u4e00\u8d77\u7834\u788e\u7c89\u788e\u597d\u7684\u773c\u5f71\u4e0d\u4e70\u793c\u7269\u6e05\u6d01\u5242\u810f\u5c31\u50cf\u5546\u54c1\u662f\u538b\u529b\n\u6ca1\u6709\u751f\u4ea7\u65e5\u671f,\u6211\u4e0d\u77e5\u9053\u662f\u5426\u771f\u5b9e,\u603b\u662f\u89c9\u5f97\u6709\u70b9\u5947\u602a\n\u662f\u4e00\u4e2a\u5c0f\u98de\u7c89\u5417\n\u4f46\u662f\u4e00\u4e9b\u6df7\u5408\u7684\u989c\u8272\n\u6709\u51e0\u6b21,\u73b0\u5728\u8fd9\u4e2a\u4e1c\u897f,\u7b14\u662f\u7a7a\u7684\n\u773c\u5f71\u6709\u70b9\u5c0f,\u5c11\u4e00\u70b9\u3002\n\u4e0d\u597d\u7684\u989c\u8272,\u7c89\u7ea2\u8272\n\u660e\u661f\u4e0d\u60f3\u4e70,\u574f\u4e86,\u4e0d\u5bb9\u6613,\u4e0d\u8981\u5728\u4e4e\u592a\u591a!\n\u4e00\u5f00\u59cb\u6211\u5df2\u7ecf\u8054\u7cfb\u5feb\u9012,\u5feb\u9012\u4e00\u76f4\u62d6,\u8bf4\u4ed6\u5c06\u8fd4\u56de\u5c06\u8054\u7cfb\u5feb\u9012\u670d\u52a1\n\u753b\u4e0d\u662f,\u662f\u4e0d\u597d\u7684\n\u7269\u7406\u548c\u7167\u7247\u6709\u5f88\u5927\u7684\u533a\u522b\n\u4e0d\u8981\u628a\u773c\u5f71\u5237\u4e0d\u662f\u5f88\u65b9\u4fbf\n\u611f\u89c9\u597d\u5e72,\u989c\u8272\u66f4\u6697\n\u6253\u7834\u4e86\u5728\u8fd0\u8f93\u9014\u4e2d,\u6709\u70b9\u592a\u8106\u5f31\u2026\n\u76d2\u5b50\u6709\u70b9\u574f\u4e86,\u8fd8\u6ca1\u6709\u53d1\u9001\u3002\n\n```\n\nparam explain\uff1a\n\n :param file_path: src file path\n :param batch_num: default 30\n :param reWrite: default True. means you can rewrite file , False means you can append data after this file.\n :param suffix: new file suffix\n\n\n\n## Reference:\n\nhttps://github.com/jasonwei20/eda_nlp\n\nCode for the ICLR 2019 Workshop paper: Easy data augmentation techniques for boosting performance on text classification tasks. https://arxiv.org/abs/1901.11196\n\n\n## License\n\n[MIT](./LICENSE)\n\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "this is data augmentation for chinese text",
"version": "0.1.0.6",
"split_keywords": [
"classification",
"expansion",
"augmentation",
"addition",
"data",
"text",
"chinese"
],
"urls": [
{
"comment_text": "",
"digests": {
"md5": "8618be2a6192ad5cc0f776ec852d696b",
"sha256": "28c6baabd9ca539648cb8c8cb68c34bf1dfdfaf4fdeb61638bb6adbd5da2fb34"
},
"downloads": -1,
"filename": "textda-0.1.0.6-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "8618be2a6192ad5cc0f776ec852d696b",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": null,
"size": 14007,
"upload_time": "2019-08-11T06:53:07",
"upload_time_iso_8601": "2019-08-11T06:53:07.165313Z",
"url": "https://files.pythonhosted.org/packages/b3/ed/091104cd0788ee166ecc8b6e4e90b4360a5397355052725e6d42937d97c4/textda-0.1.0.6-py2.py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"md5": "2457625e6ba1c9cbc9539788420e335b",
"sha256": "e3564367c85bd915eede083bcea2537559d209b85c3b1fa5ca6272e800298647"
},
"downloads": -1,
"filename": "textda-0.1.0.6-py3-none-any.whl",
"has_sig": false,
"md5_digest": "2457625e6ba1c9cbc9539788420e335b",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 13987,
"upload_time": "2019-05-29T06:59:32",
"upload_time_iso_8601": "2019-05-29T06:59:32.630271Z",
"url": "https://files.pythonhosted.org/packages/45/c3/28473db1835202ce6c2f16393273cef29662e84eef662cd108ac82611247/textda-0.1.0.6-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2019-08-11 06:53:07",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "wac81",
"github_project": "textda",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "textda"
}