that-nlp-library


Namethat-nlp-library JSON
Version 0.2.2 PyPI version JSON
download
home_pagehttps://github.com/anhquan0412/that-nlp-library
SummaryAim to be a convenient NLP library with the help from HuggingFace
upload_time2024-05-03 17:43:07
maintainerNone
docs_urlNone
authorQuan Tran
requires_python>=3.9
licenseApache Software License 2.0
keywords nbdev python nlp natural language processing transformer deep learning envibert roberta gpt2 phobert
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Welcome to that-nlp-library

<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

## Install

``` sh
pip install that_nlp_library
```

It is advised that you manually install torch (with your compatible cuda
version if you GPU). Typically it’s

``` sh
pip3 install torch --index-url https://download.pytorch.org/whl/cu118
```

Visit [Pytorch page](https://pytorch.org/) for more information

# High-Level Overview

## Supervised Learning

For supervised learning, the main pipeline contains 2 parts:

### **Text Data Controller: [`TextDataController`](https://anhquan0412.github.io/that-nlp-library/text_main.html#textdatacontroller) (for text processing)**

Here is a list of processings that you can use (in order). You also can
skip any processing if you want to.

![](images/text_processings.PNG)

Here is an example of the Text Controller for a classification task
(predict `Division Name`), without any text preprocessing. The code will
also tokenize your text field.

``` python3
tdc = TextDataController.from_csv('sample_data/Womens_Clothing_Reviews.csv',
                                  main_text='Review Text',
                                  label_names='Division Name',
                                  sup_types='classification',                                  
                                 )
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
tdc.process_and_tokenize(tokenizer,max_length=100,shuffle_trn=True)
```

And here is an example when all processings are applied

``` python3
from underthesea import text_normalize
import nlpaug.augmenter.char as nac

# define the augmentation function
def nlp_aug(x,aug=None):
    results = aug.augment(x)
    if not isinstance(x,list): return results[0]
    return results
aug = nac.KeyboardAug(aug_char_max=3,aug_char_p=0.1,aug_word_p=0.07)
nearby_aug_func = partial(nlp_aug,aug=aug)

# initialize the TextDataController
tdc = TextDataController.from_csv(dset,
                                  main_text='Review Text',
                                  
                                  # metadatas
                                  metadatas='Title',
                                  
                                  # label
                                  label_names='Division Name',
                                  sup_types='classification',
                                  label_tfm_dict={'Division Name': lambda x: x if x!='Initmates' else 'Intimates'},
                                  
                                  # row filter
                                  filter_dict={'Review Text': lambda x: x is not None,
                                               'Division Name': lambda x: x is not None,
                                              },
                                              
                                  # text transformation
                                  content_transformation=[text_normalize,str.lower],
                                  
                                  # validation split
                                  val_ratio=0.2,
                                  stratify_cols=['Division Name'],
                                  
                                  # upsampling
                                  upsampling_list=[('Division Name',lambda x: x=='Intimates')]
                                  
                                  # text augmentation
                                  content_augmentations=nearby_aug_func
                                 )

tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
tdc.process_and_tokenize(tokenizer,max_length=100,shuffle_trn=True)
```

For an in-depth tutorial on Text Controller for Supervised Learning
([`TextDataController`](https://anhquan0412.github.io/that-nlp-library/text_main.html#textdatacontroller)),
please visit
[here](https://anhquan0412.github.io/that-nlp-library/text_main.html)

This library also a **streamed version of Text Controller**
([`TextDataControllerStreaming`](https://anhquan0412.github.io/that-nlp-library/text_main_streaming.html#textdatacontrollerstreaming)),
allowing you to work with data without having it entirely on your hard
drive. You can still perform all the processings in the non-streamed
version, except for **Train/Validation split** (which means you have to
define your validation set beforehand), and **Upsampling**.

For more details on **streaming**, visit [how to create a streamed
dataset](https://anhquan0412.github.io/that-nlp-library/text_main_streaming.html)
and [how to train a model with a streamed
dataset](https://anhquan0412.github.io/that-nlp-library/roberta_singlehead_for_streaming)

If you are curious on the time and space efficiency between streamed and
non-streamed version, visit the benchmark
[here](https://anhquan0412.github.io/that-nlp-library/text_main_benchmark.html)

### **Model and [`ModelController`](https://anhquan0412.github.io/that-nlp-library/model_main.html#modelcontroller)**

The library can perform the following:

- **Classification ([simple
  tutorial](https://anhquan0412.github.io/that-nlp-library/model_classification_tutorial.html))**

- **[Regression](https://anhquan0412.github.io/that-nlp-library/roberta_multihead_regression.html)**

- **[Multilabel
  classification](https://anhquan0412.github.io/that-nlp-library/roberta_multilabel.html)**

- **[Multiheads](https://anhquan0412.github.io//that-nlp-library/roberta_multihead.html)**,
  where each head can be either classification or regression

  - “Multihead” is when your model needs to predict multiple outputs at
    once, for example, given a sentence (e.g. a review on an e-commerce
    site), you have to predict what category the sentence is about, and
    the sentiment of the sentence, and maybe the rating of the sentence.

  - For the above example, this is a 3-head problem: classification (for
    category), classification (for sentiment), and regression (for
    rating from 1 to 5)

- For 2-head classification where there’s hierarchical relationship
  between the first output and the second output (e.g. the first output
  is level 1 clothing category, and the second output is the level 2
  clothing subcategory), you can utilize two specific approaches for
  this use-case: training with [conditional
  probability](https://anhquan0412.github.io/that-nlp-library/roberta_conditional_prob.html),
  or with [deep hierarchical
  classification](https://anhquan0412.github.io/that-nlp-library/roberta_dhc.html)

### Decoupling of Text Controller and Model Controller

In this library, you can either use
[`TextDataController`](https://anhquan0412.github.io/that-nlp-library/text_main.html#textdatacontroller)
only to handle all the text processings, and have the final
processed-HuggingFace-DatasetDict returned to you. But if you have your
own processed DatasetDict, you can skip the text controller and use only
the
[`ModelController`](https://anhquan0412.github.io/that-nlp-library/model_main.html#modelcontroller)
for training your data. There’s a quick tutorial on this decoupling
[here](https://anhquan0412.github.io/that-nlp-library/model_classification_tutorial.html#train-model-with-only-a-tokenized-datasetdict-no-textdatacontroller)

## Language Modeling

For language modeling, the main pipeline also contains 2 parts

### Text Data Controller for Language Model: [`TextDataLMController`](https://anhquan0412.github.io/that-nlp-library/text_main_lm.html#textdatalmcontroller)

Similarly to `TextDatController`,
[`TextDataLMController`](https://anhquan0412.github.io/that-nlp-library/text_main_lm.html#textdatalmcontroller)
also provide a list of processings (except for **Label Processing**,
**Upsampling** and **Text Augmentation**). The controller also allow
tokenization line-by-line or by token concatenation. Visit the tutorial
[here](https://anhquan0412.github.io/that-nlp-library/text_main_lm.html)

There’s also a streamed version
([`TextDataLMControllerStreaming`](https://anhquan0412.github.io/that-nlp-library/text_main_lm_streaming.html#textdatalmcontrollerstreaming))

### Language Model Controller: [`ModelLMController`](https://anhquan0412.github.io/that-nlp-library/model_lm_main.html#modellmcontroller)

The library can train a [masked language
modeling](https://anhquan0412.github.io/that-nlp-library/model_lm_roberta_tutorial.html)
(BERT, roBERTa …) or a [causal language
model](https://anhquan0412.github.io/that-nlp-library/model_lm_gpt2_tutorial.html)
(GPT) either from scratch or from existing pretrained language models.

### Hidden States Extraction

The library also allow you to [extract the hidden
states](https://anhquan0412.github.io/that-nlp-library/hidden_states.html)
of your choice, for further analysis

# Documentation

Visit <https://anhquan0412.github.io/that-nlp-library/>

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/anhquan0412/that-nlp-library",
    "name": "that-nlp-library",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "nbdev, python, nlp, natural language processing, transformer, deep learning, envibert, roberta, gpt2, phobert",
    "author": "Quan Tran",
    "author_email": "anhquan0412@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/b7/c2/27b953d5012d5db8aa78724b410b46951f61696fd04b15f60033adebaa53/that-nlp-library-0.2.2.tar.gz",
    "platform": null,
    "description": "# Welcome to that-nlp-library\n\n<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->\n\n## Install\n\n``` sh\npip install that_nlp_library\n```\n\nIt is advised that you manually install torch (with your compatible cuda\nversion if you GPU). Typically it\u2019s\n\n``` sh\npip3 install torch --index-url https://download.pytorch.org/whl/cu118\n```\n\nVisit [Pytorch page](https://pytorch.org/) for more information\n\n# High-Level Overview\n\n## Supervised Learning\n\nFor supervised learning, the main pipeline contains 2 parts:\n\n### **Text Data Controller: [`TextDataController`](https://anhquan0412.github.io/that-nlp-library/text_main.html#textdatacontroller) (for text processing)**\n\nHere is a list of processings that you can use (in order). You also can\nskip any processing if you want to.\n\n![](images/text_processings.PNG)\n\nHere is an example of the Text Controller for a classification task\n(predict `Division Name`), without any text preprocessing. The code will\nalso tokenize your text field.\n\n``` python3\ntdc = TextDataController.from_csv('sample_data/Womens_Clothing_Reviews.csv',\n                                  main_text='Review Text',\n                                  label_names='Division Name',\n                                  sup_types='classification',                                  \n                                 )\ntokenizer = RobertaTokenizer.from_pretrained('roberta-base')\ntdc.process_and_tokenize(tokenizer,max_length=100,shuffle_trn=True)\n```\n\nAnd here is an example when all processings are applied\n\n``` python3\nfrom underthesea import text_normalize\nimport nlpaug.augmenter.char as nac\n\n# define the augmentation function\ndef nlp_aug(x,aug=None):\n    results = aug.augment(x)\n    if not isinstance(x,list): return results[0]\n    return results\naug = nac.KeyboardAug(aug_char_max=3,aug_char_p=0.1,aug_word_p=0.07)\nnearby_aug_func = partial(nlp_aug,aug=aug)\n\n# initialize the TextDataController\ntdc = TextDataController.from_csv(dset,\n                                  main_text='Review Text',\n                                  \n                                  # metadatas\n                                  metadatas='Title',\n                                  \n                                  # label\n                                  label_names='Division Name',\n                                  sup_types='classification',\n                                  label_tfm_dict={'Division Name': lambda x: x if x!='Initmates' else 'Intimates'},\n                                  \n                                  # row filter\n                                  filter_dict={'Review Text': lambda x: x is not None,\n                                               'Division Name': lambda x: x is not None,\n                                              },\n                                              \n                                  # text transformation\n                                  content_transformation=[text_normalize,str.lower],\n                                  \n                                  # validation split\n                                  val_ratio=0.2,\n                                  stratify_cols=['Division Name'],\n                                  \n                                  # upsampling\n                                  upsampling_list=[('Division Name',lambda x: x=='Intimates')]\n                                  \n                                  # text augmentation\n                                  content_augmentations=nearby_aug_func\n                                 )\n\ntokenizer = RobertaTokenizer.from_pretrained('roberta-base')\ntdc.process_and_tokenize(tokenizer,max_length=100,shuffle_trn=True)\n```\n\nFor an in-depth tutorial on Text Controller for Supervised Learning\n([`TextDataController`](https://anhquan0412.github.io/that-nlp-library/text_main.html#textdatacontroller)),\nplease visit\n[here](https://anhquan0412.github.io/that-nlp-library/text_main.html)\n\nThis library also a **streamed version of Text Controller**\n([`TextDataControllerStreaming`](https://anhquan0412.github.io/that-nlp-library/text_main_streaming.html#textdatacontrollerstreaming)),\nallowing you to work with data without having it entirely on your hard\ndrive. You can still perform all the processings in the non-streamed\nversion, except for **Train/Validation split** (which means you have to\ndefine your validation set beforehand), and **Upsampling**.\n\nFor more details on **streaming**, visit [how to create a streamed\ndataset](https://anhquan0412.github.io/that-nlp-library/text_main_streaming.html)\nand [how to train a model with a streamed\ndataset](https://anhquan0412.github.io/that-nlp-library/roberta_singlehead_for_streaming)\n\nIf you are curious on the time and space efficiency between streamed and\nnon-streamed version, visit the benchmark\n[here](https://anhquan0412.github.io/that-nlp-library/text_main_benchmark.html)\n\n### **Model and [`ModelController`](https://anhquan0412.github.io/that-nlp-library/model_main.html#modelcontroller)**\n\nThe library can perform the following:\n\n- **Classification ([simple\n  tutorial](https://anhquan0412.github.io/that-nlp-library/model_classification_tutorial.html))**\n\n- **[Regression](https://anhquan0412.github.io/that-nlp-library/roberta_multihead_regression.html)**\n\n- **[Multilabel\n  classification](https://anhquan0412.github.io/that-nlp-library/roberta_multilabel.html)**\n\n- **[Multiheads](https://anhquan0412.github.io//that-nlp-library/roberta_multihead.html)**,\n  where each head can be either classification or regression\n\n  - \u201cMultihead\u201d is when your model needs to predict multiple outputs at\n    once, for example, given a sentence (e.g.\u00a0a review on an e-commerce\n    site), you have to predict what category the sentence is about, and\n    the sentiment of the sentence, and maybe the rating of the sentence.\n\n  - For the above example, this is a 3-head problem: classification (for\n    category), classification (for sentiment), and regression (for\n    rating from 1 to 5)\n\n- For 2-head classification where there\u2019s hierarchical relationship\n  between the first output and the second output (e.g.\u00a0the first output\n  is level 1 clothing category, and the second output is the level 2\n  clothing subcategory), you can utilize two specific approaches for\n  this use-case: training with [conditional\n  probability](https://anhquan0412.github.io/that-nlp-library/roberta_conditional_prob.html),\n  or with [deep hierarchical\n  classification](https://anhquan0412.github.io/that-nlp-library/roberta_dhc.html)\n\n### Decoupling of Text Controller and Model Controller\n\nIn this library, you can either use\n[`TextDataController`](https://anhquan0412.github.io/that-nlp-library/text_main.html#textdatacontroller)\nonly to handle all the text processings, and have the final\nprocessed-HuggingFace-DatasetDict returned to you. But if you have your\nown processed DatasetDict, you can skip the text controller and use only\nthe\n[`ModelController`](https://anhquan0412.github.io/that-nlp-library/model_main.html#modelcontroller)\nfor training your data. There\u2019s a quick tutorial on this decoupling\n[here](https://anhquan0412.github.io/that-nlp-library/model_classification_tutorial.html#train-model-with-only-a-tokenized-datasetdict-no-textdatacontroller)\n\n## Language Modeling\n\nFor language modeling, the main pipeline also contains 2 parts\n\n### Text Data Controller for Language Model: [`TextDataLMController`](https://anhquan0412.github.io/that-nlp-library/text_main_lm.html#textdatalmcontroller)\n\nSimilarly to `TextDatController`,\n[`TextDataLMController`](https://anhquan0412.github.io/that-nlp-library/text_main_lm.html#textdatalmcontroller)\nalso provide a list of processings (except for **Label Processing**,\n**Upsampling** and **Text Augmentation**). The controller also allow\ntokenization line-by-line or by token concatenation. Visit the tutorial\n[here](https://anhquan0412.github.io/that-nlp-library/text_main_lm.html)\n\nThere\u2019s also a streamed version\n([`TextDataLMControllerStreaming`](https://anhquan0412.github.io/that-nlp-library/text_main_lm_streaming.html#textdatalmcontrollerstreaming))\n\n### Language Model Controller: [`ModelLMController`](https://anhquan0412.github.io/that-nlp-library/model_lm_main.html#modellmcontroller)\n\nThe library can train a [masked language\nmodeling](https://anhquan0412.github.io/that-nlp-library/model_lm_roberta_tutorial.html)\n(BERT, roBERTa \u2026) or a [causal language\nmodel](https://anhquan0412.github.io/that-nlp-library/model_lm_gpt2_tutorial.html)\n(GPT) either from scratch or from existing pretrained language models.\n\n### Hidden States Extraction\n\nThe library also allow you to [extract the hidden\nstates](https://anhquan0412.github.io/that-nlp-library/hidden_states.html)\nof your choice, for further analysis\n\n# Documentation\n\nVisit <https://anhquan0412.github.io/that-nlp-library/>\n",
    "bugtrack_url": null,
    "license": "Apache Software License 2.0",
    "summary": "Aim to be a convenient NLP library with the help from HuggingFace",
    "version": "0.2.2",
    "project_urls": {
        "Homepage": "https://github.com/anhquan0412/that-nlp-library"
    },
    "split_keywords": [
        "nbdev",
        " python",
        " nlp",
        " natural language processing",
        " transformer",
        " deep learning",
        " envibert",
        " roberta",
        " gpt2",
        " phobert"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e486cc793f2cf3c38593ce14f430ac679b48f35fa7a3a13a27da31d1364468c0",
                "md5": "5527955d1e220f727b24f482d24b9168",
                "sha256": "b06821a2ccf1d1b816dc4fed65999f14d8f340a7d1398b28d04c12df5d731073"
            },
            "downloads": -1,
            "filename": "that_nlp_library-0.2.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "5527955d1e220f727b24f482d24b9168",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 60929,
            "upload_time": "2024-05-03T17:43:05",
            "upload_time_iso_8601": "2024-05-03T17:43:05.926441Z",
            "url": "https://files.pythonhosted.org/packages/e4/86/cc793f2cf3c38593ce14f430ac679b48f35fa7a3a13a27da31d1364468c0/that_nlp_library-0.2.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "b7c227b953d5012d5db8aa78724b410b46951f61696fd04b15f60033adebaa53",
                "md5": "070880e54605e22a67b3d06da5561f91",
                "sha256": "fed2f5b1ab75d6eb4fafca6615428106d4bd3e6d732f9601747acd04b67a1466"
            },
            "downloads": -1,
            "filename": "that-nlp-library-0.2.2.tar.gz",
            "has_sig": false,
            "md5_digest": "070880e54605e22a67b3d06da5561f91",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 47907,
            "upload_time": "2024-05-03T17:43:07",
            "upload_time_iso_8601": "2024-05-03T17:43:07.855211Z",
            "url": "https://files.pythonhosted.org/packages/b7/c2/27b953d5012d5db8aa78724b410b46951f61696fd04b15f60033adebaa53/that-nlp-library-0.2.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-05-03 17:43:07",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "anhquan0412",
    "github_project": "that-nlp-library",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "that-nlp-library"
}
        
Elapsed time: 0.73444s