# few-shot-learning-nlp
This library provides tools and utilities for Few Shot Learning in Natural Language Processing (NLP).
## Overview
Few Shot Learning in NLP involves training and evaluating models on tasks with limited labeled data. This library offers functionalities to facilitate this process.
## Installation
You can install this library via pip:
```bash
pip install -U few-shot-learning-nlp
```
## Documentation
The documentation for this library is available [here](https://peulsilva.github.io/few-shot-learning-nlp/).
## Supported Approaches
### Text Classification
- Sentence Transformers Finetuning ([SetFit](https://arxiv.org/abs/2209.11055))
- Pattern Exploiting Training ([PET](https://arxiv.org/abs/2001.07676))
### Named Entity Recognition for Image Documents
- Pattern Exploiting Training ([PET](https://arxiv.org/abs/2001.07676))
- [Bio Technique](https://arxiv.org/abs/2305.04928)
### Classification Utils
- [Focal Loss function for imbalanced datasets](https://arxiv.org/abs/1708.02002)
- Stratified train test split
## Usage
To utilize this library, import the necessary classes and methods and follow the provided [documentation](https://peulsilva.github.io/few-shot-learning-nlp/) for each component.
Here is a short example of the SetFit implementation
```python
from datasets import load_dataset
import pandas as pd
from few_shot_learning_nlp.utils import stratified_train_test_split
from torch.utils.data import DataLoader
from few_shot_learning_nlp.few_shot_text_classification.setfit_dataset import SetFitDataset
# Load a dataset for text classification
ag_news_dataset = load_dataset("ag_news")
# Extract necessary information from the dataset
num_classes = len(ag_news_dataset['train'].features['label'].names)
# Perform few-shot learning by selecting a limited number of classes
n_shots = 50
train_validation, test_df = stratified_train_test_split(ag_news_dataset['train'], num_shots_per_class=n_shots)
train_df, val_df = stratified_train_test_split(pd.DataFrame(train_validation), num_shots_per_class=30)
# Create SetFitDataset objects for training and validation
set_fit_data_train = SetFitDataset(train_df['text'], train_df['label'], input_example_format=True)
set_fit_data_val = SetFitDataset(val_df['text'], val_df['label'], input_example_format=False)
# Create DataLoader objects for training and validation datasets
train_dataloader = DataLoader(set_fit_data_train.data, shuffle=False)
val_dataloader = DataLoader(set_fit_data_val)
```
### Defining Classifier
```python
import torch
class CLF(torch.nn.Module):
def __init__(
self,
in_features : int,
out_features : int,
*args,
**kwargs
) -> None:
super().__init__(*args, **kwargs)
self.layer1 = torch.nn.Linear(in_features, 128)
self.relu = torch.nn.ReLU()
self.layer2 = torch.nn.Linear(128, 32)
self.layer3 = torch.nn.Linear(32, out_features)
def forward(self, x : torch.Tensor):
x = self.layer1(x)
x = self.relu(x)
x = self.layer2(x)
x = self.relu(x)
return self.layer3(x)
```
### Training the Embedding Model <a name="training-the-embedding-model"></a>
```python
import torch
from sentence_transformers import SentenceTransformer
from few_shot_learning_nlp.few_shot_text_classification.setfit import SetFitTrainer
# Load a pre-trained Sentence Transformer model
model = SentenceTransformer("whaleloops/phrase-bert")
# Initialize the SetFitTrainer with embedding model and classifier
embedding_model = model.to("cuda")
in_features = embedding_model.get_sentence_embedding_dimension()
clf = CLF(in_features, num_classes).to("cuda")
trainer = SetFitTrainer(embedding_model, clf, num_classes)
# Train the embedding model
trainer.train_embedding(train_dataloader, val_dataloader, n_epochs=10)
```
### Training the Classifier Model <a name="training-the-classifier-model"></a>
```python
# Shuffle training data
_, class_counts = np.unique(train_df['label'], return_counts=True)
X_train_shuffled, y_train_shuffled = shuffle_two_lists(train_df['text'], train_df['label'])
# Train the classifier
history, embedding_model, clf = trainer.train_classifier(
X_train_shuffled, y_train_shuffled, val_df['text'], val_df['label'],
clf=CLF(in_features, num_classes),
n_epochs=15,
lr=1e-4
)
```
### Testing the Models <a name="testing-the-models"></a>
```python
y_true, y_pred = trainer.test(test_df)
```
Raw data
{
"_id": null,
"home_page": "https://github.com/peulsilva/few-shot-learning-nlp",
"name": "few-shot-learning-nlp",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": null,
"author": "Pedro Silva",
"author_email": "pedrolmssilva@gmail.com",
"download_url": null,
"platform": null,
"description": "# few-shot-learning-nlp\n\nThis library provides tools and utilities for Few Shot Learning in Natural Language Processing (NLP).\n\n## Overview\n\nFew Shot Learning in NLP involves training and evaluating models on tasks with limited labeled data. This library offers functionalities to facilitate this process.\n\n## Installation\n\nYou can install this library via pip:\n\n```bash\npip install -U few-shot-learning-nlp\n```\n\n## Documentation\n\nThe documentation for this library is available [here](https://peulsilva.github.io/few-shot-learning-nlp/).\n\n## Supported Approaches\n\n### Text Classification\n- Sentence Transformers Finetuning ([SetFit](https://arxiv.org/abs/2209.11055))\n- Pattern Exploiting Training ([PET](https://arxiv.org/abs/2001.07676))\n\n### Named Entity Recognition for Image Documents\n- Pattern Exploiting Training ([PET](https://arxiv.org/abs/2001.07676))\n- [Bio Technique](https://arxiv.org/abs/2305.04928)\n\n### Classification Utils\n- [Focal Loss function for imbalanced datasets](https://arxiv.org/abs/1708.02002)\n- Stratified train test split\n\n## Usage\n\nTo utilize this library, import the necessary classes and methods and follow the provided [documentation](https://peulsilva.github.io/few-shot-learning-nlp/) for each component.\n\nHere is a short example of the SetFit implementation\n\n\n```python\nfrom datasets import load_dataset\nimport pandas as pd\nfrom few_shot_learning_nlp.utils import stratified_train_test_split\nfrom torch.utils.data import DataLoader\nfrom few_shot_learning_nlp.few_shot_text_classification.setfit_dataset import SetFitDataset\n\n# Load a dataset for text classification\nag_news_dataset = load_dataset(\"ag_news\")\n\n# Extract necessary information from the dataset\nnum_classes = len(ag_news_dataset['train'].features['label'].names)\n\n# Perform few-shot learning by selecting a limited number of classes\nn_shots = 50\ntrain_validation, test_df = stratified_train_test_split(ag_news_dataset['train'], num_shots_per_class=n_shots)\ntrain_df, val_df = stratified_train_test_split(pd.DataFrame(train_validation), num_shots_per_class=30)\n\n# Create SetFitDataset objects for training and validation\nset_fit_data_train = SetFitDataset(train_df['text'], train_df['label'], input_example_format=True)\nset_fit_data_val = SetFitDataset(val_df['text'], val_df['label'], input_example_format=False)\n\n# Create DataLoader objects for training and validation datasets\ntrain_dataloader = DataLoader(set_fit_data_train.data, shuffle=False)\nval_dataloader = DataLoader(set_fit_data_val)\n```\n\n### Defining Classifier\n\n```python\nimport torch\n\nclass CLF(torch.nn.Module):\n def __init__(\n self,\n in_features : int,\n out_features : int, \n *args, \n **kwargs\n ) -> None:\n super().__init__(*args, **kwargs)\n\n self.layer1 = torch.nn.Linear(in_features, 128)\n self.relu = torch.nn.ReLU()\n self.layer2 = torch.nn.Linear(128, 32)\n self.layer3 = torch.nn.Linear(32, out_features)\n\n def forward(self, x : torch.Tensor):\n x = self.layer1(x)\n x = self.relu(x)\n x = self.layer2(x)\n x = self.relu(x)\n return self.layer3(x)\n```\n\n### Training the Embedding Model <a name=\"training-the-embedding-model\"></a>\n\n```python\nimport torch\nfrom sentence_transformers import SentenceTransformer\nfrom few_shot_learning_nlp.few_shot_text_classification.setfit import SetFitTrainer\n\n# Load a pre-trained Sentence Transformer model\nmodel = SentenceTransformer(\"whaleloops/phrase-bert\")\n\n# Initialize the SetFitTrainer with embedding model and classifier\nembedding_model = model.to(\"cuda\")\nin_features = embedding_model.get_sentence_embedding_dimension()\nclf = CLF(in_features, num_classes).to(\"cuda\")\ntrainer = SetFitTrainer(embedding_model, clf, num_classes)\n\n# Train the embedding model\ntrainer.train_embedding(train_dataloader, val_dataloader, n_epochs=10)\n```\n\n### Training the Classifier Model <a name=\"training-the-classifier-model\"></a>\n\n```python\n\n# Shuffle training data\n_, class_counts = np.unique(train_df['label'], return_counts=True)\nX_train_shuffled, y_train_shuffled = shuffle_two_lists(train_df['text'], train_df['label'])\n\n# Train the classifier\nhistory, embedding_model, clf = trainer.train_classifier(\n X_train_shuffled, y_train_shuffled, val_df['text'], val_df['label'],\n clf=CLF(in_features, num_classes),\n n_epochs=15,\n lr=1e-4\n)\n```\n\n### Testing the Models <a name=\"testing-the-models\"></a>\n\n```python\ny_true, y_pred = trainer.test(test_df)\n```\n\n\n",
"bugtrack_url": null,
"license": null,
"summary": "This library provides tools and utilities for Few Shot Learning in Natural Language Processing (NLP).",
"version": "1.0.4",
"project_urls": {
"Homepage": "https://github.com/peulsilva/few-shot-learning-nlp"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "f24d9c8ef0fd029fb838eac3c17f78655889d1e67eb3b6d6c860dddf3c952f68",
"md5": "b59a8c9e13cc7ecd2dc9a20ac23be213",
"sha256": "7b523b90123307f0fb64f6b4875d12541b2fb2ec0761f24f99b59c2fc77fd61e"
},
"downloads": -1,
"filename": "few_shot_learning_nlp-1.0.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "b59a8c9e13cc7ecd2dc9a20ac23be213",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 39049,
"upload_time": "2024-05-12T14:51:02",
"upload_time_iso_8601": "2024-05-12T14:51:02.409736Z",
"url": "https://files.pythonhosted.org/packages/f2/4d/9c8ef0fd029fb838eac3c17f78655889d1e67eb3b6d6c860dddf3c952f68/few_shot_learning_nlp-1.0.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-05-12 14:51:02",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "peulsilva",
"github_project": "few-shot-learning-nlp",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "few-shot-learning-nlp"
}