InstructorEmbedding

Name	InstructorEmbedding JSON
Version	1.0.0 JSON
	download
home_page	https://github.com/HKUNLP/instructor-embedding
Summary	Text embedding tool
upload_time	2023-01-20 21:02:47
maintainer
docs_url	None
author	Hongjin SU
requires_python
license	MIT
keywords	sentence embedding text nlp instructor
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # One Embedder, Any Task: Instruction-Finetuned Text Embeddings

This repository contains the code and pre-trained models for our paper [One Embedder, Any Task: Instruction-Finetuned Text Embeddings](https://arxiv.org/abs/2212.09741). Please refer to our [project page](https://instructor-embedding.github.io/) for a quick project overview.

We introduce **Instructor**👨‍🏫, an instruction-finetuned text embedding model that can generate text embeddings tailored to any task (e.g., classification, retrieval, clustering, text evaluation, etc.) and domains (e.g., science, finance, etc.) ***by simply providing the task instruction, without any finetuning***. Instructor👨‍ achieves sota on 70 diverse embedding tasks!

## Get Started
It is very easy to use INSTRUCTOR for any text embeddings. We recommend to first create a virtual environment:
```bash
conda env create -n instructor python=3.7
git clone https://github.com/HKUNLP/instructor-embedding
pip install -r requirements.txt
```
That will create the environment `instructor` we used. To use the embedding tool, first install the `InstructorEmbedding` package from PyPI
```bash
pip install InstructorEmbedding
```
or directly install it from our code
```bash
pip install -e .
```

### Environment setup

Activate the environment by running
```bash
conda activate instructor
```

## Getting Started

First download a pretrained model

```python
from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR('hkunlp/instructor-large')
```

Then provide the sentence and customized instruction to the model.
```python
# prepare texts with instructions
text_instruction_pairs = [
    {"instruction": "Represent the Science title:", "text": "3D ActionSLAM: wearable person tracking in multi-floor environments"},
    {"instruction": "Represent the Medicien sentence for retrieving a duplicate sentence:", "text": "Recent studies have suggested that statins, an established drug group in the prevention of cardiovascular mortality, could delay or prevent breast cancer recurrence but the effect on disease-specific mortality remains unclear."}
]

# postprocess
texts_with_instructions = []
for pair in text_instruction_pairs:
    texts_with_instructions.append([pair["instruction"], pair["text"]])

# calculate embeddings
customized_embeddings = model.encode(texts_with_instructions)
```

And that's it already. We now have a list of numpy arrays with the emebddings.

```python
for pair, embedding in zip(text_instruction_pairs, customized_embeddings):
    print("Instruction: ", pair["instruction"])
    print("text: ", pair["text"])
    print("Embedding: ", embedding)
    print("")
```

## Use Cases
We provide a few specific use cases in the following. For more examples and applications, refer to [our paper](https://arxiv.org/abs/2212.09741)
### Calculate embeddings for your customized texts
If you want to calculate customized embeddings for specific sentences, you may follow the unified template to write instructions: 

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Represent the `domain` `text_type` for `task_objective`:
* `domain` is optional, and it specifies the domain of the text, e.g., science, finance, medicine, etc.
* `text_type` is required, and it specifies the encoding unit, e.g., sentence, document, paragraph, etc.
* `task_objective` is optional, and it specifies the objective of emebdding, e.g., retrieve a document, classify the sentence, etc.

### Compute similarities between texts
You can use **INSTRUCTOR** to compute similarities between two groups of sentences, with **customized embeddings**.
```python
from sklearn.metrics.pairwise import cosine_similarity
sentences_a = [['Represent the Science sentence: ','Parton energy loss in QCD matter',0], 
               ['Represent the Financial statement: ','The Federal Reserve on Wednesday raised its benchmark interest rate.',0]]
sentences_b = [['Represent the Science sentence: ','The Chiral Phase Transition in Dissipative Dynamics', 0],
               ['Represent the Financial statement: ','The funds rose less than 0.5 per cent on Friday',0]]
embeddings_a = model.encode(sentences_a)
embeddings_b = model.encode(sentences_b)
similarities = cosine_similarity(embeddings_a,embeddings_b)
```

### Use customized embeddings for information retrieval.
```python
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
query  = [['Represent the Wikipedia question for retrieving supporting documents: ','where is the food stored in a yam plant',0]]
corpus = [['Represent the Wikipedia document for retrieval: ','Capitalism has been dominant in the Western world since the end of feudalism, but most feel[who?] that the term "mixed economies" more precisely describes most contemporary economies, due to their containing both private-owned and state-owned enterprises. In capitalism, prices determine the demand-supply scale. For example, higher demand for certain goods and services lead to higher prices and lower demand for certain goods lead to lower prices.', 0],
          ['Represent the Wikipedia document for retrieval: ',"The disparate impact theory is especially controversial under the Fair Housing Act because the Act regulates many activities relating to housing, insurance, and mortgage loansâ€”and some scholars have argued that the theory's use under the Fair Housing Act, combined with extensions of the Community Reinvestment Act, contributed to rise of sub-prime lending and the crash of the U.S. housing market and ensuing global economic recession",0],
          ['Represent the Wikipedia document for retrieval: ','Disparate impact in United States labor law refers to practices in employment, housing, and other areas that adversely affect one group of people of a protected characteristic more than another, even though rules applied by employers or landlords are formally neutral. Although the protected classes vary by statute, most federal civil rights laws protect based on race, color, religion, national origin, and sex as protected traits, and some laws include disability status and other traits as well.',0]]
query_embeddings = model.encode(query)
corpus_embeddings = model.encode(corpus)
similarities = cosine_similarity(query_embeddings,corpus_embeddings)
retrieved_doc_id = np.argmax(similarities)
print(retrieved_doc_id)
```

### Use customized embeddings for clustering
```python
import sklearn.cluster
sentences = [['Represent the Medicine sentence for clustering: ','Dynamical Scalar Degree of Freedom in Horava-Lifshitz Gravity', 0],
             ['Represent the Medicine sentence for clustering: ','Comparison of Atmospheric Neutrino Flux Calculations at Low Energies', 0],
             ['Represent the Medicine sentence for clustering: ','Fermion Bags in the Massive Gross-Neveu Model', 0],
             ['Represent the Medicine sentence for clustering: ',"QCD corrections to Associated t-tbar-H production at the Tevatron",0],
             ['Represent the Medicine sentence for clustering: ','A New Analysis of the R Measurements: Resonance Parameters of the Higher,  Vector States of Charmonium',0]]
embeddings = model.encode(sentences)
clustering_model = sklearn.cluster.MiniBatchKMeans(n_clusters=2)
clustering_model.fit(embeddings)
cluster_assignment = clustering_model.labels_
print(cluster_assignment)
```
## Training
### Data
We construct Multitask Embeddings Data
with Instructions (MEDI), consisting of a collection of 330 datasets from [Super-NI](https://arxiv.org/abs/2204.07705)(Super-NaturalInstructions), [sentence-transformer embedding training data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data), and [KILT](https://arxiv.org/abs/2009.02252), spanning a wide range of domains and tasks. We construct positive and negative pairs if they are not provided, and store them in a unified format:
```
[
    {'query': ['Represent the sentence for retrieving duplicate sentences;', "I see. A faint expression of relief swept over John's face. "], 'pos': ['Represent the sentence for retrieving duplicate sentences;', "John's face changed to a faint sense of relief. "], 'neg': ['Represent the sentence for retrieving duplicate sentences;', "I look of horror fell upon John's face."], 'task_id': 301},
    {'query': ['Represent the sentence for retrieving duplicate sentences;', 'They say things like This is the coolest generation ever.'], 'pos': ['Represent the sentence for retrieving duplicate sentences;', 'These people are speaking in glowing terms about this generation.'], 'neg': ['Represent the sentence for retrieving duplicate sentences;', 'No one is or ever has spoken at all about these people.'], 'task_id': 301},
    ...
    {'query': ['Represent the sentence for retrieving duplicate sentences;', "GAO's congressional policies and protocols apply to all investigative work conducted by the Office of Special Investigations unless an exception is specified herein or noted in advance."], 'pos': ['Represent the sentence for retrieving duplicate sentences;', "GAO's congressional policies and protocols apply to all investigative work conducted by the Office of Special Investigations unless an exception is specified herein or noted in advance."], 'neg': ['Represent the sentence for retrieving duplicate sentences;', 'The GAO always applies to all investigative work conducted by the Office of Special Investigations.'], 'task_id': 301}
]
```
Each instance consists of a query, a positive pair, a negative pair and the id of the task, which is used to ensure data in the same training batch are from the same task.
The MEDI data is available to be downloaded at [this link](https://drive.google.com/file/d/1_Ax8rTeeDKpNxj85LD2Yh7qsh3sYEiSH/view?usp=sharing).

### Train INSTRUCTOR
We provide the example script for training INSTRUCTOR. You may need to first download the [MEDI data](https://drive.google.com/file/d/1_Ax8rTeeDKpNxj85LD2Yh7qsh3sYEiSH/view?usp=sharing), unzip the folder and put `medi-data.json` under `--cache_dir`.
```python
python train.py --model_name_or_path sentence-transformers/gtr-t5-large --output_dir {output_directory} --cache_dir {cache_directory} --max_source_length 512 --num_train_epochs 10 --save_steps 500 --cl_temperature 0.01 --warmup_ratio 0.1 --learning_rate 2e-5 --overwrite_output_dir
```
We explain the arguments in the following:
* `--model_name_or_path`: Pretrained checkpoints to start with. We support both model id (e.g., `sentence-transformers/gtr-t5-large`, `sentence-transformers/sentence-t5-large`) or checkpoint path (e.g., checkpoint saved by transformers trainer).
* `--cl_temperature`: Temperature for contrastive loss
* `--cache_dir`: The directory to cache downloaded models and data. The downloaded MEDI data(`medi-data.json`) should be put under the directory `--cache_dir`.
* `--output_dir`: The directory to store the trained models(checkpoints) for evaluation. 

All the other arguments are standard `Huggingface's transformers` training arguments, such as `--overwrite_output_dir`, `--num_train_epochs`, `--learning_rate`. For details, refer to [Huggingface transformers](https://github.com/huggingface/transformers) 

## Evaluation
We evalute INSTRUCTOR massively on 70 diverse tasks, spanning a wide range of tasks and domains. Specifically, we build our evaluation on three benchmarks, [MTEB](https://huggingface.co/spaces/mteb/leaderboard), [Billboard](https://arxiv.org/abs/2112.04139), and [Prompt Retrieval](https://arxiv.org/abs/2209.01975). We explain the details about running evaluation scripts in the following.
<!-- * MTEB is a comprehensive embedding evaluation benchmark that aims to provide a holistic view of embedding models.  It combines several conventional benchmarks (e.g., BEIR and STS) and spans a wide range of domain-specific datasets, including science, biology, and medicine. 
* Prompt Retrieval tasks aim to retrieve a few in-context learning (i.e., demonstration) examples from annotated examples given a test instance. The embedding model is used to encode all annotated examples and to find the few most similar examples to the test instance based on the cosine similarity. We evalute emebddings by measuring the average performance on the downstream tasks. 
* Billboard applies INSTRUCTOR to automatic evaluations for text generation tasks. Following [Kasai et al. (2022a)](https://arxiv.org/abs/2112.04139), we measure the cosine similarity between the generated text and each reference text and take the maximum similarity score over all references available. We evaluate all embedding models by the Pearson correlation with the human judgments. -->

### MTEB
To evaluate the model performance on MTEB benchmark dataset, first install the MTEB library

```python
cd evaluation/MTEB
pip install -e .
```
Then run the following command:
```python
python examples/evaluate_model.py --model_name hkunlp/instructor-large --output_dir outputs --task_name ArguAna --result_file results
```
You can evaluate your trained model checkpoints by specifying `--model_name` and run all MTEB datasets by changing `--task_name`. Check [our paper](https://arxiv.org/abs/2212.09741) or [MTEB benchmark](https://huggingface.co/spaces/mteb/leaderboard) for evaluation metrics of all tasks.

### Billboard
To evaluate the model performance on Billboard, run the following command:
```python
cd evaluation/text_evaluation
python main.py --model_name hkunlp/instructor-large --task mscoco --add_prompt
```
You can evaluate your trained model checkpoints by specifying `--model_name` and run all Billboard datasets by changing `--task`. In all of the three datasets in Billboard, we report the Pearson correlation.

### Prompt Retrieval
To evaluate the model performance on Prompt Retrieval, run the following command:
```python
cd evaluation/prompt_retrieval
python main.py --embedding_model hkunlp/instructor-large --task rte --model_cache_dir {cache_dir} --output_dir {output_dir} --add_prompt
```
You can evaluate your trained model checkpoints by specifying `--model_name` and run prompt retrieval datasets by changing `--task`. In order to have a consistent metric, we cast all tasks in Prompt Retrieval into a "text-to-text" format, and report the Rouge-L score.

## Bugs or questions?
If you have any question related to the code or the paper, feel free to email Hongjin (`hjsu@cs.hku.hk`) and Weijia (`swj0419@cs.washington.edu`). Please try to specify the problem with details so we can help you better and quicker.

## Citation
If you find our work helpful, please cite us:

```bibtex
@inproceedings{INSTRUCTOR,
  title={One Embedder, Any Task: Instruction-Finetuned Text Embeddings},
  author={Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A. Smith, Luke Zettlemoyer, Tao Yu},
  url={https://arxiv.org/abs/2212.09741},
  year={2022},
}
```

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/HKUNLP/instructor-embedding",
    "name": "InstructorEmbedding",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "sentence,embedding,text,nlp,instructor",
    "author": "Hongjin SU",
    "author_email": "hjsu@cs.hku.hk",
    "download_url": "https://files.pythonhosted.org/packages/a9/ce/32f51c985722a92c45a4e71ecdd200687b7cae860e1d3e3cb4c28f6a5c7e/InstructorEmbedding-1.0.0.tar.gz",
    "platform": null,
    "description": "# One Embedder, Any Task: Instruction-Finetuned Text Embeddings\n\nThis repository contains the code and pre-trained models for our paper [One Embedder, Any Task: Instruction-Finetuned Text Embeddings](https://arxiv.org/abs/2212.09741). Please refer to our [project page](https://instructor-embedding.github.io/) for a quick project overview.\n\nWe introduce **Instructor**\ud83d\udc68\u200d\ud83c\udfeb, an instruction-finetuned text embedding model that can generate text embeddings tailored to any task (e.g., classification, retrieval, clustering, text evaluation, etc.) and domains (e.g., science, finance, etc.) ***by simply providing the task instruction, without any finetuning***. Instructor\ud83d\udc68\u200d achieves sota on 70 diverse embedding tasks!\n\n## Get Started\nIt is very easy to use INSTRUCTOR for any text embeddings. We recommend to first create a virtual environment:\n```bash\nconda env create -n instructor python=3.7\ngit clone https://github.com/HKUNLP/instructor-embedding\npip install -r requirements.txt\n```\nThat will create the environment `instructor` we used. To use the embedding tool, first install the `InstructorEmbedding` package from PyPI\n```bash\npip install InstructorEmbedding\n```\nor directly install it from our code\n```bash\npip install -e .\n```\n\n### Environment setup\n\nActivate the environment by running\n```bash\nconda activate instructor\n```\n\n## Getting Started\n\nFirst download a pretrained model\n\n```python\nfrom InstructorEmbedding import INSTRUCTOR\nmodel = INSTRUCTOR('hkunlp/instructor-large')\n```\n\nThen provide the sentence and customized instruction to the model.\n```python\n# prepare texts with instructions\ntext_instruction_pairs = [\n    {\"instruction\": \"Represent the Science title:\", \"text\": \"3D ActionSLAM: wearable person tracking in multi-floor environments\"},\n    {\"instruction\": \"Represent the Medicien sentence for retrieving a duplicate sentence:\", \"text\": \"Recent studies have suggested that statins, an established drug group in the prevention of cardiovascular mortality, could delay or prevent breast cancer recurrence but the effect on disease-specific mortality remains unclear.\"}\n]\n\n# postprocess\ntexts_with_instructions = []\nfor pair in text_instruction_pairs:\n    texts_with_instructions.append([pair[\"instruction\"], pair[\"text\"]])\n\n# calculate embeddings\ncustomized_embeddings = model.encode(texts_with_instructions)\n```\n\nAnd that's it already. We now have a list of numpy arrays with the emebddings.\n\n```python\nfor pair, embedding in zip(text_instruction_pairs, customized_embeddings):\n    print(\"Instruction: \", pair[\"instruction\"])\n    print(\"text: \", pair[\"text\"])\n    print(\"Embedding: \", embedding)\n    print(\"\")\n```\n\n## Use Cases\nWe provide a few specific use cases in the following. For more examples and applications, refer to [our paper](https://arxiv.org/abs/2212.09741)\n### Calculate embeddings for your customized texts\nIf you want to calculate customized embeddings for specific sentences, you may follow the unified template to write instructions: \n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Represent the `domain` `text_type` for `task_objective`:\n* `domain` is optional, and it specifies the domain of the text, e.g., science, finance, medicine, etc.\n* `text_type` is required, and it specifies the encoding unit, e.g., sentence, document, paragraph, etc.\n* `task_objective` is optional, and it specifies the objective of emebdding, e.g., retrieve a document, classify the sentence, etc.\n\n### Compute similarities between texts\nYou can use **INSTRUCTOR** to compute similarities between two groups of sentences, with **customized embeddings**.\n```python\nfrom sklearn.metrics.pairwise import cosine_similarity\nsentences_a = [['Represent the Science sentence: ','Parton energy loss in QCD matter',0], \n               ['Represent the Financial statement: ','The Federal Reserve on Wednesday raised its benchmark interest rate.',0]]\nsentences_b = [['Represent the Science sentence: ','The Chiral Phase Transition in Dissipative Dynamics', 0],\n               ['Represent the Financial statement: ','The funds rose less than 0.5 per cent on Friday',0]]\nembeddings_a = model.encode(sentences_a)\nembeddings_b = model.encode(sentences_b)\nsimilarities = cosine_similarity(embeddings_a,embeddings_b)\n```\n\n### Use customized embeddings for information retrieval.\n```python\nimport numpy as np\nfrom sklearn.metrics.pairwise import cosine_similarity\nquery  = [['Represent the Wikipedia question for retrieving supporting documents: ','where is the food stored in a yam plant',0]]\ncorpus = [['Represent the Wikipedia document for retrieval: ','Capitalism has been dominant in the Western world since the end of feudalism, but most feel[who?] that the term \"mixed economies\" more precisely describes most contemporary economies, due to their containing both private-owned and state-owned enterprises. In capitalism, prices determine the demand-supply scale. For example, higher demand for certain goods and services lead to higher prices and lower demand for certain goods lead to lower prices.', 0],\n          ['Represent the Wikipedia document for retrieval: ',\"The disparate impact theory is especially controversial under the Fair Housing Act because the Act regulates many activities relating to housing, insurance, and mortgage loans\u00e2\u20ac\u201dand some scholars have argued that the theory's use under the Fair Housing Act, combined with extensions of the Community Reinvestment Act, contributed to rise of sub-prime lending and the crash of the U.S. housing market and ensuing global economic recession\",0],\n          ['Represent the Wikipedia document for retrieval: ','Disparate impact in United States labor law refers to practices in employment, housing, and other areas that adversely affect one group of people of a protected characteristic more than another, even though rules applied by employers or landlords are formally neutral. Although the protected classes vary by statute, most federal civil rights laws protect based on race, color, religion, national origin, and sex as protected traits, and some laws include disability status and other traits as well.',0]]\nquery_embeddings = model.encode(query)\ncorpus_embeddings = model.encode(corpus)\nsimilarities = cosine_similarity(query_embeddings,corpus_embeddings)\nretrieved_doc_id = np.argmax(similarities)\nprint(retrieved_doc_id)\n```\n\n### Use customized embeddings for clustering\n```python\nimport sklearn.cluster\nsentences = [['Represent the Medicine sentence for clustering: ','Dynamical Scalar Degree of Freedom in Horava-Lifshitz Gravity', 0],\n             ['Represent the Medicine sentence for clustering: ','Comparison of Atmospheric Neutrino Flux Calculations at Low Energies', 0],\n             ['Represent the Medicine sentence for clustering: ','Fermion Bags in the Massive Gross-Neveu Model', 0],\n             ['Represent the Medicine sentence for clustering: ',\"QCD corrections to Associated t-tbar-H production at the Tevatron\",0],\n             ['Represent the Medicine sentence for clustering: ','A New Analysis of the R Measurements: Resonance Parameters of the Higher,  Vector States of Charmonium',0]]\nembeddings = model.encode(sentences)\nclustering_model = sklearn.cluster.MiniBatchKMeans(n_clusters=2)\nclustering_model.fit(embeddings)\ncluster_assignment = clustering_model.labels_\nprint(cluster_assignment)\n```\n## Training\n### Data\nWe construct Multitask Embeddings Data\nwith Instructions (MEDI), consisting of a collection of 330 datasets from [Super-NI](https://arxiv.org/abs/2204.07705)(Super-NaturalInstructions), [sentence-transformer embedding training data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data), and [KILT](https://arxiv.org/abs/2009.02252), spanning a wide range of domains and tasks. We construct positive and negative pairs if they are not provided, and store them in a unified format:\n```\n[\n    {'query': ['Represent the sentence for retrieving duplicate sentences;', \"I see. A faint expression of relief swept over John's face. \"], 'pos': ['Represent the sentence for retrieving duplicate sentences;', \"John's face changed to a faint sense of relief. \"], 'neg': ['Represent the sentence for retrieving duplicate sentences;', \"I look of horror fell upon John's face.\"], 'task_id': 301},\n    {'query': ['Represent the sentence for retrieving duplicate sentences;', 'They say things like This is the coolest generation ever.'], 'pos': ['Represent the sentence for retrieving duplicate sentences;', 'These people are speaking in glowing terms about this generation.'], 'neg': ['Represent the sentence for retrieving duplicate sentences;', 'No one is or ever has spoken at all about these people.'], 'task_id': 301},\n    ...\n    {'query': ['Represent the sentence for retrieving duplicate sentences;', \"GAO's congressional policies and protocols apply to all investigative work conducted by the Office of Special Investigations unless an exception is specified herein or noted in advance.\"], 'pos': ['Represent the sentence for retrieving duplicate sentences;', \"GAO's congressional policies and protocols apply to all investigative work conducted by the Office of Special Investigations unless an exception is specified herein or noted in advance.\"], 'neg': ['Represent the sentence for retrieving duplicate sentences;', 'The GAO always applies to all investigative work conducted by the Office of Special Investigations.'], 'task_id': 301}\n]\n```\nEach instance consists of a query, a positive pair, a negative pair and the id of the task, which is used to ensure data in the same training batch are from the same task.\nThe MEDI data is available to be downloaded at [this link](https://drive.google.com/file/d/1_Ax8rTeeDKpNxj85LD2Yh7qsh3sYEiSH/view?usp=sharing).\n\n### Train INSTRUCTOR\nWe provide the example script for training INSTRUCTOR. You may need to first download the [MEDI data](https://drive.google.com/file/d/1_Ax8rTeeDKpNxj85LD2Yh7qsh3sYEiSH/view?usp=sharing), unzip the folder and put `medi-data.json` under `--cache_dir`.\n```python\npython train.py --model_name_or_path sentence-transformers/gtr-t5-large --output_dir {output_directory} --cache_dir {cache_directory} --max_source_length 512 --num_train_epochs 10 --save_steps 500 --cl_temperature 0.01 --warmup_ratio 0.1 --learning_rate 2e-5 --overwrite_output_dir\n```\nWe explain the arguments in the following:\n* `--model_name_or_path`: Pretrained checkpoints to start with. We support both model id (e.g., `sentence-transformers/gtr-t5-large`, `sentence-transformers/sentence-t5-large`) or checkpoint path (e.g., checkpoint saved by transformers trainer).\n* `--cl_temperature`: Temperature for contrastive loss\n* `--cache_dir`: The directory to cache downloaded models and data. The downloaded MEDI data(`medi-data.json`) should be put under the directory `--cache_dir`.\n* `--output_dir`: The directory to store the trained models(checkpoints) for evaluation. \n\nAll the other arguments are standard `Huggingface's transformers` training arguments, such as `--overwrite_output_dir`, `--num_train_epochs`, `--learning_rate`. For details, refer to [Huggingface transformers](https://github.com/huggingface/transformers) \n\n## Evaluation\nWe evalute INSTRUCTOR massively on 70 diverse tasks, spanning a wide range of tasks and domains. Specifically, we build our evaluation on three benchmarks, [MTEB](https://huggingface.co/spaces/mteb/leaderboard), [Billboard](https://arxiv.org/abs/2112.04139), and [Prompt Retrieval](https://arxiv.org/abs/2209.01975). We explain the details about running evaluation scripts in the following.\n<!-- * MTEB is a comprehensive embedding evaluation benchmark that aims to provide a holistic view of embedding models.  It combines several conventional benchmarks (e.g., BEIR and STS) and spans a wide range of domain-specific datasets, including science, biology, and medicine. \n* Prompt Retrieval tasks aim to retrieve a few in-context learning (i.e., demonstration) examples from annotated examples given a test instance. The embedding model is used to encode all annotated examples and to find the few most similar examples to the test instance based on the cosine similarity. We evalute emebddings by measuring the average performance on the downstream tasks. \n* Billboard applies INSTRUCTOR to automatic evaluations for text generation tasks. Following [Kasai et al. (2022a)](https://arxiv.org/abs/2112.04139), we measure the cosine similarity between the generated text and each reference text and take the maximum similarity score over all references available. We evaluate all embedding models by the Pearson correlation with the human judgments. -->\n\n### MTEB\nTo evaluate the model performance on MTEB benchmark dataset, first install the MTEB library\n\n```python\ncd evaluation/MTEB\npip install -e .\n```\nThen run the following command:\n```python\npython examples/evaluate_model.py --model_name hkunlp/instructor-large --output_dir outputs --task_name ArguAna --result_file results\n```\nYou can evaluate your trained model checkpoints by specifying `--model_name` and run all MTEB datasets by changing `--task_name`. Check [our paper](https://arxiv.org/abs/2212.09741) or [MTEB benchmark](https://huggingface.co/spaces/mteb/leaderboard) for evaluation metrics of all tasks.\n\n### Billboard\nTo evaluate the model performance on Billboard, run the following command:\n```python\ncd evaluation/text_evaluation\npython main.py --model_name hkunlp/instructor-large --task mscoco --add_prompt\n```\nYou can evaluate your trained model checkpoints by specifying `--model_name` and run all Billboard datasets by changing `--task`. In all of the three datasets in Billboard, we report the Pearson correlation.\n\n### Prompt Retrieval\nTo evaluate the model performance on Prompt Retrieval, run the following command:\n```python\ncd evaluation/prompt_retrieval\npython main.py --embedding_model hkunlp/instructor-large --task rte --model_cache_dir {cache_dir} --output_dir {output_dir} --add_prompt\n```\nYou can evaluate your trained model checkpoints by specifying `--model_name` and run prompt retrieval datasets by changing `--task`. In order to have a consistent metric, we cast all tasks in Prompt Retrieval into a \"text-to-text\" format, and report the Rouge-L score.\n\n## Bugs or questions?\nIf you have any question related to the code or the paper, feel free to email Hongjin (`hjsu@cs.hku.hk`) and Weijia (`swj0419@cs.washington.edu`). Please try to specify the problem with details so we can help you better and quicker.\n\n## Citation\nIf you find our work helpful, please cite us:\n\n```bibtex\n@inproceedings{INSTRUCTOR,\n  title={One Embedder, Any Task: Instruction-Finetuned Text Embeddings},\n  author={Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A. Smith, Luke Zettlemoyer, Tao Yu},\n  url={https://arxiv.org/abs/2212.09741},\n  year={2022},\n}\n```\n\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Text embedding tool",
    "version": "1.0.0",
    "split_keywords": [
        "sentence",
        "embedding",
        "text",
        "nlp",
        "instructor"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ad4abb1d8203d45c209ef3bd0792c04ef12bc3d9846f9dde608c6ce56d0ba2c2",
                "md5": "27466bcca6b39c2ea776c49c415c7e31",
                "sha256": "f73e0f1fa3649329a70f29fc5abe7d03bac96429c4e2704df8c77d6970b4bc32"
            },
            "downloads": -1,
            "filename": "InstructorEmbedding-1.0.0-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "27466bcca6b39c2ea776c49c415c7e31",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": null,
            "size": 13244,
            "upload_time": "2023-01-20T21:02:32",
            "upload_time_iso_8601": "2023-01-20T21:02:32.520365Z",
            "url": "https://files.pythonhosted.org/packages/ad/4a/bb1d8203d45c209ef3bd0792c04ef12bc3d9846f9dde608c6ce56d0ba2c2/InstructorEmbedding-1.0.0-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a9ce32f51c985722a92c45a4e71ecdd200687b7cae860e1d3e3cb4c28f6a5c7e",
                "md5": "3a644584ae2dc0f37dd1f19a47ad213b",
                "sha256": "d68f51bcfcc98afd556ba40249790eccaf928440f8a002a178fb6bf31858eb02"
            },
            "downloads": -1,
            "filename": "InstructorEmbedding-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "3a644584ae2dc0f37dd1f19a47ad213b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 13638,
            "upload_time": "2023-01-20T21:02:47",
            "upload_time_iso_8601": "2023-01-20T21:02:47.901150Z",
            "url": "https://files.pythonhosted.org/packages/a9/ce/32f51c985722a92c45a4e71ecdd200687b7cae860e1d3e3cb4c28f6a5c7e/InstructorEmbedding-1.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-01-20 21:02:47",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "HKUNLP",
    "github_project": "instructor-embedding",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "instructorembedding"
}

Hongjin SU