# Welcome to that-nlp-library
<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->
## Install
``` sh
pip install that_nlp_library
```
It is advised that you manually install torch (with your compatible cuda
version if you GPU). Typically it’s
``` sh
pip3 install torch --index-url https://download.pytorch.org/whl/cu118
```
Visit [Pytorch page](https://pytorch.org/) for more information
# High-Level Overview
## Supervised Learning
For supervised learning, the main pipeline contains 2 parts:
### **Text Data Controller: [`TextDataController`](https://anhquan0412.github.io/that-nlp-library/text_main.html#textdatacontroller) (for text processing)**
Here is a list of processings that you can use (in order). You also can
skip any processing if you want to.
![](images/text_processings.PNG)
Here is an example of the Text Controller for a classification task
(predict `Division Name`), without any text preprocessing. The code will
also tokenize your text field.
``` python3
tdc = TextDataController.from_csv('sample_data/Womens_Clothing_Reviews.csv',
main_text='Review Text',
label_names='Division Name',
sup_types='classification',
)
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
tdc.process_and_tokenize(tokenizer,max_length=100,shuffle_trn=True)
```
And here is an example when all processings are applied
``` python3
from underthesea import text_normalize
import nlpaug.augmenter.char as nac
# define the augmentation function
def nlp_aug(x,aug=None):
results = aug.augment(x)
if not isinstance(x,list): return results[0]
return results
aug = nac.KeyboardAug(aug_char_max=3,aug_char_p=0.1,aug_word_p=0.07)
nearby_aug_func = partial(nlp_aug,aug=aug)
# initialize the TextDataController
tdc = TextDataController.from_csv(dset,
main_text='Review Text',
# metadatas
metadatas='Title',
# label
label_names='Division Name',
sup_types='classification',
label_tfm_dict={'Division Name': lambda x: x if x!='Initmates' else 'Intimates'},
# row filter
filter_dict={'Review Text': lambda x: x is not None,
'Division Name': lambda x: x is not None,
},
# text transformation
content_transformation=[text_normalize,str.lower],
# validation split
val_ratio=0.2,
stratify_cols=['Division Name'],
# upsampling
upsampling_list=[('Division Name',lambda x: x=='Intimates')]
# text augmentation
content_augmentations=nearby_aug_func
)
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
tdc.process_and_tokenize(tokenizer,max_length=100,shuffle_trn=True)
```
For an in-depth tutorial on Text Controller for Supervised Learning
([`TextDataController`](https://anhquan0412.github.io/that-nlp-library/text_main.html#textdatacontroller)),
please visit
[here](https://anhquan0412.github.io/that-nlp-library/text_main.html)
This library also a **streamed version of Text Controller**
([`TextDataControllerStreaming`](https://anhquan0412.github.io/that-nlp-library/text_main_streaming.html#textdatacontrollerstreaming)),
allowing you to work with data without having it entirely on your hard
drive. You can still perform all the processings in the non-streamed
version, except for **Train/Validation split** (which means you have to
define your validation set beforehand), and **Upsampling**.
For more details on **streaming**, visit [how to create a streamed
dataset](https://anhquan0412.github.io/that-nlp-library/text_main_streaming.html)
and [how to train a model with a streamed
dataset](https://anhquan0412.github.io/that-nlp-library/roberta_singlehead_for_streaming)
If you are curious on the time and space efficiency between streamed and
non-streamed version, visit the benchmark
[here](https://anhquan0412.github.io/that-nlp-library/text_main_benchmark.html)
### **Model and [`ModelController`](https://anhquan0412.github.io/that-nlp-library/model_main.html#modelcontroller)**
The library can perform the following:
- **Classification ([simple
tutorial](https://anhquan0412.github.io/that-nlp-library/model_classification_tutorial.html))**
- **[Regression](https://anhquan0412.github.io/that-nlp-library/roberta_multihead_regression.html)**
- **[Multilabel
classification](https://anhquan0412.github.io/that-nlp-library/roberta_multilabel.html)**
- **[Multiheads](https://anhquan0412.github.io//that-nlp-library/roberta_multihead.html)**,
where each head can be either classification or regression
- “Multihead” is when your model needs to predict multiple outputs at
once, for example, given a sentence (e.g. a review on an e-commerce
site), you have to predict what category the sentence is about, and
the sentiment of the sentence, and maybe the rating of the sentence.
- For the above example, this is a 3-head problem: classification (for
category), classification (for sentiment), and regression (for
rating from 1 to 5)
- For 2-head classification where there’s hierarchical relationship
between the first output and the second output (e.g. the first output
is level 1 clothing category, and the second output is the level 2
clothing subcategory), you can utilize two specific approaches for
this use-case: training with [conditional
probability](https://anhquan0412.github.io/that-nlp-library/roberta_conditional_prob.html),
or with [deep hierarchical
classification](https://anhquan0412.github.io/that-nlp-library/roberta_dhc.html)
### Decoupling of Text Controller and Model Controller
In this library, you can either use
[`TextDataController`](https://anhquan0412.github.io/that-nlp-library/text_main.html#textdatacontroller)
only to handle all the text processings, and have the final
processed-HuggingFace-DatasetDict returned to you. But if you have your
own processed DatasetDict, you can skip the text controller and use only
the
[`ModelController`](https://anhquan0412.github.io/that-nlp-library/model_main.html#modelcontroller)
for training your data. There’s a quick tutorial on this decoupling
[here](https://anhquan0412.github.io/that-nlp-library/model_classification_tutorial.html#train-model-with-only-a-tokenized-datasetdict-no-textdatacontroller)
## Language Modeling
For language modeling, the main pipeline also contains 2 parts
### Text Data Controller for Language Model: [`TextDataLMController`](https://anhquan0412.github.io/that-nlp-library/text_main_lm.html#textdatalmcontroller)
Similarly to `TextDatController`,
[`TextDataLMController`](https://anhquan0412.github.io/that-nlp-library/text_main_lm.html#textdatalmcontroller)
also provide a list of processings (except for **Label Processing**,
**Upsampling** and **Text Augmentation**). The controller also allow
tokenization line-by-line or by token concatenation. Visit the tutorial
[here](https://anhquan0412.github.io/that-nlp-library/text_main_lm.html)
There’s also a streamed version
([`TextDataLMControllerStreaming`](https://anhquan0412.github.io/that-nlp-library/text_main_lm_streaming.html#textdatalmcontrollerstreaming))
### Language Model Controller: [`ModelLMController`](https://anhquan0412.github.io/that-nlp-library/model_lm_main.html#modellmcontroller)
The library can train a [masked language
modeling](https://anhquan0412.github.io/that-nlp-library/model_lm_roberta_tutorial.html)
(BERT, roBERTa …) or a [causal language
model](https://anhquan0412.github.io/that-nlp-library/model_lm_gpt2_tutorial.html)
(GPT) either from scratch or from existing pretrained language models.
### Hidden States Extraction
The library also allow you to [extract the hidden
states](https://anhquan0412.github.io/that-nlp-library/hidden_states.html)
of your choice, for further analysis
# Documentation
Visit <https://anhquan0412.github.io/that-nlp-library/>
Raw data
{
"_id": null,
"home_page": "https://github.com/anhquan0412/that-nlp-library",
"name": "that-nlp-library",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "nbdev, python, nlp, natural language processing, transformer, deep learning, envibert, roberta, gpt2, phobert",
"author": "Quan Tran",
"author_email": "anhquan0412@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/b7/c2/27b953d5012d5db8aa78724b410b46951f61696fd04b15f60033adebaa53/that-nlp-library-0.2.2.tar.gz",
"platform": null,
"description": "# Welcome to that-nlp-library\n\n<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->\n\n## Install\n\n``` sh\npip install that_nlp_library\n```\n\nIt is advised that you manually install torch (with your compatible cuda\nversion if you GPU). Typically it\u2019s\n\n``` sh\npip3 install torch --index-url https://download.pytorch.org/whl/cu118\n```\n\nVisit [Pytorch page](https://pytorch.org/) for more information\n\n# High-Level Overview\n\n## Supervised Learning\n\nFor supervised learning, the main pipeline contains 2 parts:\n\n### **Text Data Controller: [`TextDataController`](https://anhquan0412.github.io/that-nlp-library/text_main.html#textdatacontroller) (for text processing)**\n\nHere is a list of processings that you can use (in order). You also can\nskip any processing if you want to.\n\n![](images/text_processings.PNG)\n\nHere is an example of the Text Controller for a classification task\n(predict `Division Name`), without any text preprocessing. The code will\nalso tokenize your text field.\n\n``` python3\ntdc = TextDataController.from_csv('sample_data/Womens_Clothing_Reviews.csv',\n main_text='Review Text',\n label_names='Division Name',\n sup_types='classification', \n )\ntokenizer = RobertaTokenizer.from_pretrained('roberta-base')\ntdc.process_and_tokenize(tokenizer,max_length=100,shuffle_trn=True)\n```\n\nAnd here is an example when all processings are applied\n\n``` python3\nfrom underthesea import text_normalize\nimport nlpaug.augmenter.char as nac\n\n# define the augmentation function\ndef nlp_aug(x,aug=None):\n results = aug.augment(x)\n if not isinstance(x,list): return results[0]\n return results\naug = nac.KeyboardAug(aug_char_max=3,aug_char_p=0.1,aug_word_p=0.07)\nnearby_aug_func = partial(nlp_aug,aug=aug)\n\n# initialize the TextDataController\ntdc = TextDataController.from_csv(dset,\n main_text='Review Text',\n \n # metadatas\n metadatas='Title',\n \n # label\n label_names='Division Name',\n sup_types='classification',\n label_tfm_dict={'Division Name': lambda x: x if x!='Initmates' else 'Intimates'},\n \n # row filter\n filter_dict={'Review Text': lambda x: x is not None,\n 'Division Name': lambda x: x is not None,\n },\n \n # text transformation\n content_transformation=[text_normalize,str.lower],\n \n # validation split\n val_ratio=0.2,\n stratify_cols=['Division Name'],\n \n # upsampling\n upsampling_list=[('Division Name',lambda x: x=='Intimates')]\n \n # text augmentation\n content_augmentations=nearby_aug_func\n )\n\ntokenizer = RobertaTokenizer.from_pretrained('roberta-base')\ntdc.process_and_tokenize(tokenizer,max_length=100,shuffle_trn=True)\n```\n\nFor an in-depth tutorial on Text Controller for Supervised Learning\n([`TextDataController`](https://anhquan0412.github.io/that-nlp-library/text_main.html#textdatacontroller)),\nplease visit\n[here](https://anhquan0412.github.io/that-nlp-library/text_main.html)\n\nThis library also a **streamed version of Text Controller**\n([`TextDataControllerStreaming`](https://anhquan0412.github.io/that-nlp-library/text_main_streaming.html#textdatacontrollerstreaming)),\nallowing you to work with data without having it entirely on your hard\ndrive. You can still perform all the processings in the non-streamed\nversion, except for **Train/Validation split** (which means you have to\ndefine your validation set beforehand), and **Upsampling**.\n\nFor more details on **streaming**, visit [how to create a streamed\ndataset](https://anhquan0412.github.io/that-nlp-library/text_main_streaming.html)\nand [how to train a model with a streamed\ndataset](https://anhquan0412.github.io/that-nlp-library/roberta_singlehead_for_streaming)\n\nIf you are curious on the time and space efficiency between streamed and\nnon-streamed version, visit the benchmark\n[here](https://anhquan0412.github.io/that-nlp-library/text_main_benchmark.html)\n\n### **Model and [`ModelController`](https://anhquan0412.github.io/that-nlp-library/model_main.html#modelcontroller)**\n\nThe library can perform the following:\n\n- **Classification ([simple\n tutorial](https://anhquan0412.github.io/that-nlp-library/model_classification_tutorial.html))**\n\n- **[Regression](https://anhquan0412.github.io/that-nlp-library/roberta_multihead_regression.html)**\n\n- **[Multilabel\n classification](https://anhquan0412.github.io/that-nlp-library/roberta_multilabel.html)**\n\n- **[Multiheads](https://anhquan0412.github.io//that-nlp-library/roberta_multihead.html)**,\n where each head can be either classification or regression\n\n - \u201cMultihead\u201d is when your model needs to predict multiple outputs at\n once, for example, given a sentence (e.g.\u00a0a review on an e-commerce\n site), you have to predict what category the sentence is about, and\n the sentiment of the sentence, and maybe the rating of the sentence.\n\n - For the above example, this is a 3-head problem: classification (for\n category), classification (for sentiment), and regression (for\n rating from 1 to 5)\n\n- For 2-head classification where there\u2019s hierarchical relationship\n between the first output and the second output (e.g.\u00a0the first output\n is level 1 clothing category, and the second output is the level 2\n clothing subcategory), you can utilize two specific approaches for\n this use-case: training with [conditional\n probability](https://anhquan0412.github.io/that-nlp-library/roberta_conditional_prob.html),\n or with [deep hierarchical\n classification](https://anhquan0412.github.io/that-nlp-library/roberta_dhc.html)\n\n### Decoupling of Text Controller and Model Controller\n\nIn this library, you can either use\n[`TextDataController`](https://anhquan0412.github.io/that-nlp-library/text_main.html#textdatacontroller)\nonly to handle all the text processings, and have the final\nprocessed-HuggingFace-DatasetDict returned to you. But if you have your\nown processed DatasetDict, you can skip the text controller and use only\nthe\n[`ModelController`](https://anhquan0412.github.io/that-nlp-library/model_main.html#modelcontroller)\nfor training your data. There\u2019s a quick tutorial on this decoupling\n[here](https://anhquan0412.github.io/that-nlp-library/model_classification_tutorial.html#train-model-with-only-a-tokenized-datasetdict-no-textdatacontroller)\n\n## Language Modeling\n\nFor language modeling, the main pipeline also contains 2 parts\n\n### Text Data Controller for Language Model: [`TextDataLMController`](https://anhquan0412.github.io/that-nlp-library/text_main_lm.html#textdatalmcontroller)\n\nSimilarly to `TextDatController`,\n[`TextDataLMController`](https://anhquan0412.github.io/that-nlp-library/text_main_lm.html#textdatalmcontroller)\nalso provide a list of processings (except for **Label Processing**,\n**Upsampling** and **Text Augmentation**). The controller also allow\ntokenization line-by-line or by token concatenation. Visit the tutorial\n[here](https://anhquan0412.github.io/that-nlp-library/text_main_lm.html)\n\nThere\u2019s also a streamed version\n([`TextDataLMControllerStreaming`](https://anhquan0412.github.io/that-nlp-library/text_main_lm_streaming.html#textdatalmcontrollerstreaming))\n\n### Language Model Controller: [`ModelLMController`](https://anhquan0412.github.io/that-nlp-library/model_lm_main.html#modellmcontroller)\n\nThe library can train a [masked language\nmodeling](https://anhquan0412.github.io/that-nlp-library/model_lm_roberta_tutorial.html)\n(BERT, roBERTa \u2026) or a [causal language\nmodel](https://anhquan0412.github.io/that-nlp-library/model_lm_gpt2_tutorial.html)\n(GPT) either from scratch or from existing pretrained language models.\n\n### Hidden States Extraction\n\nThe library also allow you to [extract the hidden\nstates](https://anhquan0412.github.io/that-nlp-library/hidden_states.html)\nof your choice, for further analysis\n\n# Documentation\n\nVisit <https://anhquan0412.github.io/that-nlp-library/>\n",
"bugtrack_url": null,
"license": "Apache Software License 2.0",
"summary": "Aim to be a convenient NLP library with the help from HuggingFace",
"version": "0.2.2",
"project_urls": {
"Homepage": "https://github.com/anhquan0412/that-nlp-library"
},
"split_keywords": [
"nbdev",
" python",
" nlp",
" natural language processing",
" transformer",
" deep learning",
" envibert",
" roberta",
" gpt2",
" phobert"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "e486cc793f2cf3c38593ce14f430ac679b48f35fa7a3a13a27da31d1364468c0",
"md5": "5527955d1e220f727b24f482d24b9168",
"sha256": "b06821a2ccf1d1b816dc4fed65999f14d8f340a7d1398b28d04c12df5d731073"
},
"downloads": -1,
"filename": "that_nlp_library-0.2.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "5527955d1e220f727b24f482d24b9168",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 60929,
"upload_time": "2024-05-03T17:43:05",
"upload_time_iso_8601": "2024-05-03T17:43:05.926441Z",
"url": "https://files.pythonhosted.org/packages/e4/86/cc793f2cf3c38593ce14f430ac679b48f35fa7a3a13a27da31d1364468c0/that_nlp_library-0.2.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "b7c227b953d5012d5db8aa78724b410b46951f61696fd04b15f60033adebaa53",
"md5": "070880e54605e22a67b3d06da5561f91",
"sha256": "fed2f5b1ab75d6eb4fafca6615428106d4bd3e6d732f9601747acd04b67a1466"
},
"downloads": -1,
"filename": "that-nlp-library-0.2.2.tar.gz",
"has_sig": false,
"md5_digest": "070880e54605e22a67b3d06da5561f91",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 47907,
"upload_time": "2024-05-03T17:43:07",
"upload_time_iso_8601": "2024-05-03T17:43:07.855211Z",
"url": "https://files.pythonhosted.org/packages/b7/c2/27b953d5012d5db8aa78724b410b46951f61696fd04b15f60033adebaa53/that-nlp-library-0.2.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-05-03 17:43:07",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "anhquan0412",
"github_project": "that-nlp-library",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "that-nlp-library"
}