setfit


Namesetfit JSON
Version 1.0.3 PyPI version JSON
download
home_pagehttps://github.com/huggingface/setfit
SummaryEfficient few-shot learning with Sentence Transformers
upload_time2024-01-16 17:06:53
maintainerLewis Tunstall, Tom Aarsen
docs_urlNone
author
requires_python
licenseApache 2.0
keywords nlp machine learning fewshot learning transformers
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage
            <img src="https://raw.githubusercontent.com/huggingface/setfit/main/assets/setfit.png">

<p align="center">
    🤗 <a href="https://huggingface.co/models?library=setfit" target="_blank">Models</a> | 📊 <a href="https://huggingface.co/setfit" target="_blank">Datasets</a> | 📕 <a href="https://huggingface.co/docs/setfit" target="_blank">Documentation</a> | 📖 <a href="https://huggingface.co/blog/setfit" target="_blank">Blog</a> | 📃 <a href="https://arxiv.org/abs/2209.11055" target="_blank">Paper</a>
</p>

# SetFit - Efficient Few-shot Learning with Sentence Transformers

SetFit is an efficient and prompt-free framework for few-shot fine-tuning of [Sentence Transformers](https://sbert.net/). It achieves high accuracy with little labeled data - for instance, with only 8 labeled examples per class on the Customer Reviews sentiment dataset, SetFit is competitive with fine-tuning RoBERTa Large on the full training set of 3k examples 🤯!

Compared to other few-shot learning methods, SetFit has several unique features:

* 🗣 **No prompts or verbalizers:** Current techniques for few-shot fine-tuning require handcrafted prompts or verbalizers to convert examples into a format suitable for the underlying language model. SetFit dispenses with prompts altogether by generating rich embeddings directly from text examples.
* 🏎 **Fast to train:** SetFit doesn't require large-scale models like T0 or GPT-3 to achieve high accuracy. As a result, it is typically an order of magnitude (or more) faster to train and run inference with.
* 🌎 **Multilingual support**: SetFit can be used with any [Sentence Transformer](https://huggingface.co/models?library=sentence-transformers&sort=downloads) on the Hub, which means you can classify text in multiple languages by simply fine-tuning a multilingual checkpoint.

Check out the [SetFit Documentation](https://huggingface.co/docs/setfit) for more information!

## Installation

Download and install `setfit` by running:

```bash
pip install setfit
```

If you want the bleeding-edge version instead, install from source by running:

```bash
pip install git+https://github.com/huggingface/setfit.git
```

## Usage

The [quickstart](https://huggingface.co/docs/setfit/quickstart) is a good place to learn about training, saving, loading, and performing inference with SetFit models. 

For more examples, check out the [`notebooks`](https://github.com/huggingface/setfit/tree/main/notebooks) directory, the [tutorials](https://huggingface.co/docs/setfit/tutorials/overview), or the [how-to guides](https://huggingface.co/docs/setfit/how_to/overview).


### Training a SetFit model

`setfit` is integrated with the [Hugging Face Hub](https://huggingface.co/) and provides two main classes:

* [`SetFitModel`](https://huggingface.co/docs/setfit/reference/main#setfit.SetFitModel): a wrapper that combines a pretrained body from `sentence_transformers` and a classification head from either [`scikit-learn`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) or [`SetFitHead`](https://huggingface.co/docs/setfit/reference/main#setfit.SetFitHead) (a differentiable head built upon `PyTorch` with similar APIs to `sentence_transformers`).
* [`Trainer`](https://huggingface.co/docs/setfit/reference/trainer#setfit.Trainer): a helper class that wraps the fine-tuning process of SetFit.

Here is a simple end-to-end training example using the default classification head from `scikit-learn`:


```python
from datasets import load_dataset
from setfit import SetFitModel, Trainer, TrainingArguments, sample_dataset


# Load a dataset from the Hugging Face Hub
dataset = load_dataset("sst2")

# Simulate the few-shot regime by sampling 8 examples per class
train_dataset = sample_dataset(dataset["train"], label_column="label", num_samples=8)
eval_dataset = dataset["validation"].select(range(100))
test_dataset = dataset["validation"].select(range(100, len(dataset["validation"])))

# Load a SetFit model from Hub
model = SetFitModel.from_pretrained(
    "sentence-transformers/paraphrase-mpnet-base-v2",
    labels=["negative", "positive"],
)

args = TrainingArguments(
    batch_size=16,
    num_epochs=4,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    metric="accuracy",
    column_mapping={"sentence": "text", "label": "label"}  # Map dataset columns to text/label expected by trainer
)

# Train and evaluate
trainer.train()
metrics = trainer.evaluate(test_dataset)
print(metrics)
# {'accuracy': 0.8691709844559585}

# Push model to the Hub
trainer.push_to_hub("tomaarsen/setfit-paraphrase-mpnet-base-v2-sst2")

# Download from Hub
model = SetFitModel.from_pretrained("tomaarsen/setfit-paraphrase-mpnet-base-v2-sst2")
# Run inference
preds = model.predict(["i loved the spiderman movie!", "pineapple on pizza is the worst 🤮"])
print(preds)
# ["positive", "negative"]
```


## Reproducing the results from the paper

We provide scripts to reproduce the results for SetFit and various baselines presented in Table 2 of our paper. Check out the setup and training instructions in the [`scripts/`](scripts/) directory.

## Developer installation

To run the code in this project, first create a Python virtual environment using e.g. Conda:

```bash
conda create -n setfit python=3.9 && conda activate setfit
```

Then install the base requirements with:

```bash
pip install -e '.[dev]'
```

This will install mandatory packages for SetFit like `datasets` as well as development packages like `black` and `isort` that we use to ensure consistent code formatting.

### Formatting your code

We use `black` and `isort` to ensure consistent code formatting. After following the installation steps, you can check your code locally by running:

```
make style && make quality
```

## Project structure

```
├── LICENSE
├── Makefile        <- Makefile with commands like `make style` or `make tests`
├── README.md       <- The top-level README for developers using this project.
├── docs            <- Documentation source
├── notebooks       <- Jupyter notebooks.
├── final_results   <- Model predictions from the paper
├── scripts         <- Scripts for training and inference
├── setup.cfg       <- Configuration file to define package metadata
├── setup.py        <- Make this project pip installable with `pip install -e`
├── src             <- Source code for SetFit
└── tests           <- Unit tests
```

## Related work

* [https://github.com/pmbaumgartner/setfit](https://github.com/pmbaumgartner/setfit) - A scikit-learn API version of SetFit.
* [jxpress/setfit-pytorch-lightning](https://github.com/jxpress/setfit-pytorch-lightning) - A PyTorch Lightning implementation of SetFit.
* [davidberenstein1957/spacy-setfit](https://github.com/davidberenstein1957/spacy-setfit) - An easy and intuitive approach to use SetFit in combination with spaCy. 

## Citation

```bibtex
@misc{https://doi.org/10.48550/arxiv.2209.11055,
  doi = {10.48550/ARXIV.2209.11055},
  url = {https://arxiv.org/abs/2209.11055},
  author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren},
  keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {Efficient Few-Shot Learning Without Prompts},
  publisher = {arXiv},
  year = {2022},
  copyright = {Creative Commons Attribution 4.0 International}
}
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/huggingface/setfit",
    "name": "setfit",
    "maintainer": "Lewis Tunstall, Tom Aarsen",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "lewis@huggingface.co",
    "keywords": "nlp,machine learning,fewshot learning,transformers",
    "author": "",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/17/08/9708ff91bcba68ff6830d00e166823a64d03b298001334b2786a3c068f04/setfit-1.0.3.tar.gz",
    "platform": null,
    "description": "<img src=\"https://raw.githubusercontent.com/huggingface/setfit/main/assets/setfit.png\">\n\n<p align=\"center\">\n    \ud83e\udd17 <a href=\"https://huggingface.co/models?library=setfit\" target=\"_blank\">Models</a> | \ud83d\udcca <a href=\"https://huggingface.co/setfit\" target=\"_blank\">Datasets</a> | \ud83d\udcd5 <a href=\"https://huggingface.co/docs/setfit\" target=\"_blank\">Documentation</a> | \ud83d\udcd6 <a href=\"https://huggingface.co/blog/setfit\" target=\"_blank\">Blog</a> | \ud83d\udcc3 <a href=\"https://arxiv.org/abs/2209.11055\" target=\"_blank\">Paper</a>\n</p>\n\n# SetFit - Efficient Few-shot Learning with Sentence Transformers\n\nSetFit is an efficient and prompt-free framework for few-shot fine-tuning of [Sentence Transformers](https://sbert.net/). It achieves high accuracy with little labeled data - for instance, with only 8 labeled examples per class on the Customer Reviews sentiment dataset, SetFit is competitive with fine-tuning RoBERTa Large on the full training set of 3k examples \ud83e\udd2f!\n\nCompared to other few-shot learning methods, SetFit has several unique features:\n\n* \ud83d\udde3 **No prompts or verbalizers:** Current techniques for few-shot fine-tuning require handcrafted prompts or verbalizers to convert examples into a format suitable for the underlying language model. SetFit dispenses with prompts altogether by generating rich embeddings directly from text examples.\n* \ud83c\udfce **Fast to train:** SetFit doesn't require large-scale models like T0 or GPT-3 to achieve high accuracy. As a result, it is typically an order of magnitude (or more) faster to train and run inference with.\n* \ud83c\udf0e **Multilingual support**: SetFit can be used with any [Sentence Transformer](https://huggingface.co/models?library=sentence-transformers&sort=downloads) on the Hub, which means you can classify text in multiple languages by simply fine-tuning a multilingual checkpoint.\n\nCheck out the [SetFit Documentation](https://huggingface.co/docs/setfit) for more information!\n\n## Installation\n\nDownload and install `setfit` by running:\n\n```bash\npip install setfit\n```\n\nIf you want the bleeding-edge version instead, install from source by running:\n\n```bash\npip install git+https://github.com/huggingface/setfit.git\n```\n\n## Usage\n\nThe [quickstart](https://huggingface.co/docs/setfit/quickstart) is a good place to learn about training, saving, loading, and performing inference with SetFit models. \n\nFor more examples, check out the [`notebooks`](https://github.com/huggingface/setfit/tree/main/notebooks) directory, the [tutorials](https://huggingface.co/docs/setfit/tutorials/overview), or the [how-to guides](https://huggingface.co/docs/setfit/how_to/overview).\n\n\n### Training a SetFit model\n\n`setfit` is integrated with the [Hugging Face Hub](https://huggingface.co/) and provides two main classes:\n\n* [`SetFitModel`](https://huggingface.co/docs/setfit/reference/main#setfit.SetFitModel): a wrapper that combines a pretrained body from `sentence_transformers` and a classification head from either [`scikit-learn`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) or [`SetFitHead`](https://huggingface.co/docs/setfit/reference/main#setfit.SetFitHead) (a differentiable head built upon `PyTorch` with similar APIs to `sentence_transformers`).\n* [`Trainer`](https://huggingface.co/docs/setfit/reference/trainer#setfit.Trainer): a helper class that wraps the fine-tuning process of SetFit.\n\nHere is a simple end-to-end training example using the default classification head from `scikit-learn`:\n\n\n```python\nfrom datasets import load_dataset\nfrom setfit import SetFitModel, Trainer, TrainingArguments, sample_dataset\n\n\n# Load a dataset from the Hugging Face Hub\ndataset = load_dataset(\"sst2\")\n\n# Simulate the few-shot regime by sampling 8 examples per class\ntrain_dataset = sample_dataset(dataset[\"train\"], label_column=\"label\", num_samples=8)\neval_dataset = dataset[\"validation\"].select(range(100))\ntest_dataset = dataset[\"validation\"].select(range(100, len(dataset[\"validation\"])))\n\n# Load a SetFit model from Hub\nmodel = SetFitModel.from_pretrained(\n    \"sentence-transformers/paraphrase-mpnet-base-v2\",\n    labels=[\"negative\", \"positive\"],\n)\n\nargs = TrainingArguments(\n    batch_size=16,\n    num_epochs=4,\n    evaluation_strategy=\"epoch\",\n    save_strategy=\"epoch\",\n    load_best_model_at_end=True,\n)\n\ntrainer = Trainer(\n    model=model,\n    args=args,\n    train_dataset=train_dataset,\n    eval_dataset=eval_dataset,\n    metric=\"accuracy\",\n    column_mapping={\"sentence\": \"text\", \"label\": \"label\"}  # Map dataset columns to text/label expected by trainer\n)\n\n# Train and evaluate\ntrainer.train()\nmetrics = trainer.evaluate(test_dataset)\nprint(metrics)\n# {'accuracy': 0.8691709844559585}\n\n# Push model to the Hub\ntrainer.push_to_hub(\"tomaarsen/setfit-paraphrase-mpnet-base-v2-sst2\")\n\n# Download from Hub\nmodel = SetFitModel.from_pretrained(\"tomaarsen/setfit-paraphrase-mpnet-base-v2-sst2\")\n# Run inference\npreds = model.predict([\"i loved the spiderman movie!\", \"pineapple on pizza is the worst \ud83e\udd2e\"])\nprint(preds)\n# [\"positive\", \"negative\"]\n```\n\n\n## Reproducing the results from the paper\n\nWe provide scripts to reproduce the results for SetFit and various baselines presented in Table 2 of our paper. Check out the setup and training instructions in the [`scripts/`](scripts/) directory.\n\n## Developer installation\n\nTo run the code in this project, first create a Python virtual environment using e.g. Conda:\n\n```bash\nconda create -n setfit python=3.9 && conda activate setfit\n```\n\nThen install the base requirements with:\n\n```bash\npip install -e '.[dev]'\n```\n\nThis will install mandatory packages for SetFit like `datasets` as well as development packages like `black` and `isort` that we use to ensure consistent code formatting.\n\n### Formatting your code\n\nWe use `black` and `isort` to ensure consistent code formatting. After following the installation steps, you can check your code locally by running:\n\n```\nmake style && make quality\n```\n\n## Project structure\n\n```\n\u251c\u2500\u2500 LICENSE\n\u251c\u2500\u2500 Makefile        <- Makefile with commands like `make style` or `make tests`\n\u251c\u2500\u2500 README.md       <- The top-level README for developers using this project.\n\u251c\u2500\u2500 docs            <- Documentation source\n\u251c\u2500\u2500 notebooks       <- Jupyter notebooks.\n\u251c\u2500\u2500 final_results   <- Model predictions from the paper\n\u251c\u2500\u2500 scripts         <- Scripts for training and inference\n\u251c\u2500\u2500 setup.cfg       <- Configuration file to define package metadata\n\u251c\u2500\u2500 setup.py        <- Make this project pip installable with `pip install -e`\n\u251c\u2500\u2500 src             <- Source code for SetFit\n\u2514\u2500\u2500 tests           <- Unit tests\n```\n\n## Related work\n\n* [https://github.com/pmbaumgartner/setfit](https://github.com/pmbaumgartner/setfit) - A scikit-learn API version of SetFit.\n* [jxpress/setfit-pytorch-lightning](https://github.com/jxpress/setfit-pytorch-lightning) - A PyTorch Lightning implementation of SetFit.\n* [davidberenstein1957/spacy-setfit](https://github.com/davidberenstein1957/spacy-setfit) - An easy and intuitive approach to use SetFit in combination with spaCy. \n\n## Citation\n\n```bibtex\n@misc{https://doi.org/10.48550/arxiv.2209.11055,\n  doi = {10.48550/ARXIV.2209.11055},\n  url = {https://arxiv.org/abs/2209.11055},\n  author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren},\n  keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},\n  title = {Efficient Few-Shot Learning Without Prompts},\n  publisher = {arXiv},\n  year = {2022},\n  copyright = {Creative Commons Attribution 4.0 International}\n}\n```\n",
    "bugtrack_url": null,
    "license": "Apache 2.0",
    "summary": "Efficient few-shot learning with Sentence Transformers",
    "version": "1.0.3",
    "project_urls": {
        "Download": "https://github.com/huggingface/setfit/tags",
        "Homepage": "https://github.com/huggingface/setfit"
    },
    "split_keywords": [
        "nlp",
        "machine learning",
        "fewshot learning",
        "transformers"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9cf424958ec08ccc83387f432df43c738b06c7219b199667fc3e596b3cb3efe4",
                "md5": "ad87a59fd442824350cff407042952e8",
                "sha256": "4eb7f19786c58392334e1435255ed312253cac90c73eb7684673781a5f84c6ed"
            },
            "downloads": -1,
            "filename": "setfit-1.0.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ad87a59fd442824350cff407042952e8",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 75940,
            "upload_time": "2024-01-16T17:06:51",
            "upload_time_iso_8601": "2024-01-16T17:06:51.617652Z",
            "url": "https://files.pythonhosted.org/packages/9c/f4/24958ec08ccc83387f432df43c738b06c7219b199667fc3e596b3cb3efe4/setfit-1.0.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "17089708ff91bcba68ff6830d00e166823a64d03b298001334b2786a3c068f04",
                "md5": "6b68e8b8b02bcdf672757e6860fc2c29",
                "sha256": "e2dd4f0a06ff32dd178862e35b56e8b22c84f4e949117d6fd2edfe2512748ce2"
            },
            "downloads": -1,
            "filename": "setfit-1.0.3.tar.gz",
            "has_sig": false,
            "md5_digest": "6b68e8b8b02bcdf672757e6860fc2c29",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 69931,
            "upload_time": "2024-01-16T17:06:53",
            "upload_time_iso_8601": "2024-01-16T17:06:53.708741Z",
            "url": "https://files.pythonhosted.org/packages/17/08/9708ff91bcba68ff6830d00e166823a64d03b298001334b2786a3c068f04/setfit-1.0.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-01-16 17:06:53",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "huggingface",
    "github_project": "setfit",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": true,
    "lcname": "setfit"
}
        
Elapsed time: 0.17269s