textda


Nametextda JSON
Version 0.1.0.6 PyPI version JSON
download
home_pagehttps://github.com/wac81/textda
Summarythis is data augmentation for chinese text
upload_time2019-08-11 06:53:07
maintainer
docs_urlNone
authorwac
requires_python
licenseMIT
keywords classification expansion augmentation addition data text chinese
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # This is Data Augmentation for Chinese text for Python3

## Usage
### you have two func for Chinese text Data Augmentation 

### Install textda
pip install:

```bash
pip install textda
```

1. you can expansion data use **data_expansion**
```python
from textda.data_expansion import *
print(data_expansion('生活里的惬意,无需等到春暖花开')) 

```
output:

```python
['生活里面的惬意,无需等到春暖花开', 
'生活里的等到春暖花开',
'生活里无需惬意,的等到春暖花开', 
'生活里的惬意,无需等到春暖花开', 
'生活里的惬意,并不需要等到春暖花开', 
'生活无需的惬意,里等到春暖花开', 
'生活里的惬意,等到无需春暖花开']

```

param explain:

    :param sentence: input sentence text
    :param alpha_sr: Replace synonym control param. bigger means more words are Replace
    :param alpha_ri: Random insert. bigger means more words are Insert
    :param alpha_rs: Random swap. bigger means more words are swap
    :param p_rd: Random delete. bigger means more words are deleted
    :param num_aug: How many times do you repeat each method

- you can use parameters alpha_sr, alpha_ri, alpha_rs, p_rd, num_aug can control ouput.

    if you set alpha_ri and alpha_rs is 0 that means use **linear classifier** for it, and insensitive to word location

    like this:
  ```python

  from textda.data_expansion import *

  print(data_expansion('生活里的惬意,无需等到春暖花开', alpha_ri=0, alpha_rs=0))

  ```
  output:

  ```python
  ['生活里的惬意,无需等到春暖花开', 
      ',无需春暖花开', 
      '生活里面的惬意,无需等到春暖花开', 
      '生活里的惬意,需等到春暖花开']

  ```



2. you can use **translate_batch** like this:

```python
from textda.youdao_translate import *
dir = './data'
translate_batch(os.path.join(dir, 'insurance_train'), batch_num=30)

```

```
# translate results:  chinese->english and english -> chinese

颜色碰掉了一个角不延迟,但事情或他们不赠送,或发送,眉笔打开已经破碎,磨山楂,也不打破一只手,轻轻刷掉,持久性不长,
这个用户没有填写评价内容
颜色非常不喜欢它
不说话,缓慢的新领域
不太容易染好骑吗
不是很好我喜欢!
没有颜色的眼影
应该有大礼物盒眼影,礼物不礼物盒,没有一起破碎粉碎好的眼影不买礼物清洁剂脏就像商品是压力
没有生产日期,我不知道是否真实,总是觉得有点奇怪
是一个小飞粉吗
但是一些混合的颜色
有几次,现在这个东西,笔是空的
眼影有点小,少一点。
不好的颜色,粉红色
明星不想买,坏了,不容易,不要在乎太多!
一开始我已经联系快递,快递一直拖,说他将返回将联系快递服务
画不是,是不好的
物理和照片有很大的区别
不要把眼影刷不是很方便
感觉好干,颜色更暗
打破了在运输途中,有点太脆弱…
盒子有点坏了,还没有发送。

```

param explain:

    :param file_path: src file path
    :param batch_num: default 30
    :param reWrite: default True. means you can rewrite file , False means you can append data after this file.
    :param suffix: new file suffix



## Reference:

https://github.com/jasonwei20/eda_nlp

Code for the ICLR 2019 Workshop paper: Easy data augmentation techniques for boosting performance on text classification tasks. https://arxiv.org/abs/1901.11196


## License

[MIT](./LICENSE)



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/wac81/textda",
    "name": "textda",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "classification,expansion,augmentation,addition,data,text,chinese",
    "author": "wac",
    "author_email": "wuanch@gmail.com",
    "download_url": "",
    "platform": "",
    "description": "# This is Data Augmentation for Chinese text for Python3\n\n## Usage\n### you have two func for Chinese text Data Augmentation \n\n### Install textda\npip install:\n\n```bash\npip install textda\n```\n\n1. you can expansion data use **data_expansion**\n```python\nfrom textda.data_expansion import *\nprint(data_expansion('\u751f\u6d3b\u91cc\u7684\u60ec\u610f\uff0c\u65e0\u9700\u7b49\u5230\u6625\u6696\u82b1\u5f00')) \n\n```\noutput:\n\n```python\n['\u751f\u6d3b\u91cc\u9762\u7684\u60ec\u610f\uff0c\u65e0\u9700\u7b49\u5230\u6625\u6696\u82b1\u5f00', \n'\u751f\u6d3b\u91cc\u7684\u7b49\u5230\u6625\u6696\u82b1\u5f00',\n'\u751f\u6d3b\u91cc\u65e0\u9700\u60ec\u610f\uff0c\u7684\u7b49\u5230\u6625\u6696\u82b1\u5f00', \n'\u751f\u6d3b\u91cc\u7684\u60ec\u610f\uff0c\u65e0\u9700\u7b49\u5230\u6625\u6696\u82b1\u5f00', \n'\u751f\u6d3b\u91cc\u7684\u60ec\u610f\uff0c\u5e76\u4e0d\u9700\u8981\u7b49\u5230\u6625\u6696\u82b1\u5f00', \n'\u751f\u6d3b\u65e0\u9700\u7684\u60ec\u610f\uff0c\u91cc\u7b49\u5230\u6625\u6696\u82b1\u5f00', \n'\u751f\u6d3b\u91cc\u7684\u60ec\u610f\uff0c\u7b49\u5230\u65e0\u9700\u6625\u6696\u82b1\u5f00']\n\n```\n\nparam explain\uff1a\n\n    :param sentence: input sentence text\n    :param alpha_sr: Replace synonym control param. bigger means more words are Replace\n    :param alpha_ri: Random insert. bigger means more words are Insert\n    :param alpha_rs: Random swap. bigger means more words are swap\n    :param p_rd: Random delete. bigger means more words are deleted\n    :param num_aug: How many times do you repeat each method\n\n- you can use parameters alpha_sr, alpha_ri, alpha_rs, p_rd, num_aug can control ouput.\n\n    if you set alpha_ri and alpha_rs is 0 that means use **linear classifier** for it, and insensitive to word location\n\n    like this:\n  ```python\n\n  from textda.data_expansion import *\n\n  print(data_expansion('\u751f\u6d3b\u91cc\u7684\u60ec\u610f\uff0c\u65e0\u9700\u7b49\u5230\u6625\u6696\u82b1\u5f00', alpha_ri=0, alpha_rs=0))\n\n  ```\n  output:\n\n  ```python\n  ['\u751f\u6d3b\u91cc\u7684\u60ec\u610f\uff0c\u65e0\u9700\u7b49\u5230\u6625\u6696\u82b1\u5f00', \n      '\uff0c\u65e0\u9700\u6625\u6696\u82b1\u5f00', \n      '\u751f\u6d3b\u91cc\u9762\u7684\u60ec\u610f\uff0c\u65e0\u9700\u7b49\u5230\u6625\u6696\u82b1\u5f00', \n      '\u751f\u6d3b\u91cc\u7684\u60ec\u610f\uff0c\u9700\u7b49\u5230\u6625\u6696\u82b1\u5f00']\n\n  ```\n\n\n\n2. you can use **translate_batch** like this:\n\n```python\nfrom textda.youdao_translate import *\ndir = './data'\ntranslate_batch(os.path.join(dir, 'insurance_train'), batch_num=30)\n\n```\n\n```\n# translate results:  chinese->english and english -> chinese\n\n\u989c\u8272\u78b0\u6389\u4e86\u4e00\u4e2a\u89d2\u4e0d\u5ef6\u8fdf,\u4f46\u4e8b\u60c5\u6216\u4ed6\u4eec\u4e0d\u8d60\u9001,\u6216\u53d1\u9001,\u7709\u7b14\u6253\u5f00\u5df2\u7ecf\u7834\u788e,\u78e8\u5c71\u6942,\u4e5f\u4e0d\u6253\u7834\u4e00\u53ea\u624b,\u8f7b\u8f7b\u5237\u6389,\u6301\u4e45\u6027\u4e0d\u957f,\n\u8fd9\u4e2a\u7528\u6237\u6ca1\u6709\u586b\u5199\u8bc4\u4ef7\u5185\u5bb9\n\u989c\u8272\u975e\u5e38\u4e0d\u559c\u6b22\u5b83\n\u4e0d\u8bf4\u8bdd,\u7f13\u6162\u7684\u65b0\u9886\u57df\n\u4e0d\u592a\u5bb9\u6613\u67d3\u597d\u9a91\u5417\n\u4e0d\u662f\u5f88\u597d\u6211\u559c\u6b22!\n\u6ca1\u6709\u989c\u8272\u7684\u773c\u5f71\n\u5e94\u8be5\u6709\u5927\u793c\u7269\u76d2\u773c\u5f71,\u793c\u7269\u4e0d\u793c\u7269\u76d2,\u6ca1\u6709\u4e00\u8d77\u7834\u788e\u7c89\u788e\u597d\u7684\u773c\u5f71\u4e0d\u4e70\u793c\u7269\u6e05\u6d01\u5242\u810f\u5c31\u50cf\u5546\u54c1\u662f\u538b\u529b\n\u6ca1\u6709\u751f\u4ea7\u65e5\u671f,\u6211\u4e0d\u77e5\u9053\u662f\u5426\u771f\u5b9e,\u603b\u662f\u89c9\u5f97\u6709\u70b9\u5947\u602a\n\u662f\u4e00\u4e2a\u5c0f\u98de\u7c89\u5417\n\u4f46\u662f\u4e00\u4e9b\u6df7\u5408\u7684\u989c\u8272\n\u6709\u51e0\u6b21,\u73b0\u5728\u8fd9\u4e2a\u4e1c\u897f,\u7b14\u662f\u7a7a\u7684\n\u773c\u5f71\u6709\u70b9\u5c0f,\u5c11\u4e00\u70b9\u3002\n\u4e0d\u597d\u7684\u989c\u8272,\u7c89\u7ea2\u8272\n\u660e\u661f\u4e0d\u60f3\u4e70,\u574f\u4e86,\u4e0d\u5bb9\u6613,\u4e0d\u8981\u5728\u4e4e\u592a\u591a!\n\u4e00\u5f00\u59cb\u6211\u5df2\u7ecf\u8054\u7cfb\u5feb\u9012,\u5feb\u9012\u4e00\u76f4\u62d6,\u8bf4\u4ed6\u5c06\u8fd4\u56de\u5c06\u8054\u7cfb\u5feb\u9012\u670d\u52a1\n\u753b\u4e0d\u662f,\u662f\u4e0d\u597d\u7684\n\u7269\u7406\u548c\u7167\u7247\u6709\u5f88\u5927\u7684\u533a\u522b\n\u4e0d\u8981\u628a\u773c\u5f71\u5237\u4e0d\u662f\u5f88\u65b9\u4fbf\n\u611f\u89c9\u597d\u5e72,\u989c\u8272\u66f4\u6697\n\u6253\u7834\u4e86\u5728\u8fd0\u8f93\u9014\u4e2d,\u6709\u70b9\u592a\u8106\u5f31\u2026\n\u76d2\u5b50\u6709\u70b9\u574f\u4e86,\u8fd8\u6ca1\u6709\u53d1\u9001\u3002\n\n```\n\nparam explain\uff1a\n\n    :param file_path: src file path\n    :param batch_num: default 30\n    :param reWrite: default True. means you can rewrite file , False means you can append data after this file.\n    :param suffix: new file suffix\n\n\n\n## Reference:\n\nhttps://github.com/jasonwei20/eda_nlp\n\nCode for the ICLR 2019 Workshop paper: Easy data augmentation techniques for boosting performance on text classification tasks. https://arxiv.org/abs/1901.11196\n\n\n## License\n\n[MIT](./LICENSE)\n\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "this is data augmentation for chinese text",
    "version": "0.1.0.6",
    "split_keywords": [
        "classification",
        "expansion",
        "augmentation",
        "addition",
        "data",
        "text",
        "chinese"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "8618be2a6192ad5cc0f776ec852d696b",
                "sha256": "28c6baabd9ca539648cb8c8cb68c34bf1dfdfaf4fdeb61638bb6adbd5da2fb34"
            },
            "downloads": -1,
            "filename": "textda-0.1.0.6-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "8618be2a6192ad5cc0f776ec852d696b",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": null,
            "size": 14007,
            "upload_time": "2019-08-11T06:53:07",
            "upload_time_iso_8601": "2019-08-11T06:53:07.165313Z",
            "url": "https://files.pythonhosted.org/packages/b3/ed/091104cd0788ee166ecc8b6e4e90b4360a5397355052725e6d42937d97c4/textda-0.1.0.6-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "2457625e6ba1c9cbc9539788420e335b",
                "sha256": "e3564367c85bd915eede083bcea2537559d209b85c3b1fa5ca6272e800298647"
            },
            "downloads": -1,
            "filename": "textda-0.1.0.6-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "2457625e6ba1c9cbc9539788420e335b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 13987,
            "upload_time": "2019-05-29T06:59:32",
            "upload_time_iso_8601": "2019-05-29T06:59:32.630271Z",
            "url": "https://files.pythonhosted.org/packages/45/c3/28473db1835202ce6c2f16393273cef29662e84eef662cd108ac82611247/textda-0.1.0.6-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2019-08-11 06:53:07",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "wac81",
    "github_project": "textda",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "textda"
}
        
wac
Elapsed time: 0.02252s