angle-emb


Nameangle-emb JSON
Version 0.5.6 PyPI version JSON
download
home_pageNone
SummaryAnglE-optimize Text Embeddings
upload_time2025-01-15 06:14:16
maintainerNone
docs_urlNone
authorsean lee
requires_pythonNone
licenseNone
keywords angle_emb
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <small>EN | [įŽ€äŊ“中文](README_zh.md) </small>

# AnglE 📐
> <small>Sponsored by <a href="https://www.mixedbread.ai/">Mixedbread</a></small>

**For more detailed usage, please read the 📘 document:** https://angle.readthedocs.io/en/latest/index.html

<a href="https://arxiv.org/abs/2309.12871">
    <img src="https://img.shields.io/badge/Arxiv-2309.12871-yellow.svg?style=flat-square" alt="https://arxiv.org/abs/2309.12871" />
</a>
<a href="https://pypi.org/project/angle_emb/">
    <img src="https://img.shields.io/pypi/v/angle_emb?style=flat-square" alt="PyPI version" />
</a>
<a href="https://pypi.org/project/angle_emb/">
    <img src="https://img.shields.io/pypi/dm/angle_emb?style=flat-square" alt="PyPI Downloads" />
</a>
<a href="https://angle.readthedocs.io/en/latest/index.html">
    <img src="https://readthedocs.org/projects/angle/badge/?version=latest&style=flat-square" alt="Read the docs" />
</a>


[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/angle-optimized-text-embeddings/semantic-textual-similarity-on-sick-r-1)](https://paperswithcode.com/sota/semantic-textual-similarity-on-sick-r-1?p=angle-optimized-text-embeddings)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/angle-optimized-text-embeddings/semantic-textual-similarity-on-sts16)](https://paperswithcode.com/sota/semantic-textual-similarity-on-sts16?p=angle-optimized-text-embeddings)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/angle-optimized-text-embeddings/semantic-textual-similarity-on-sts15)](https://paperswithcode.com/sota/semantic-textual-similarity-on-sts15?p=angle-optimized-text-embeddings)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/angle-optimized-text-embeddings/semantic-textual-similarity-on-sts14)](https://paperswithcode.com/sota/semantic-textual-similarity-on-sts14?p=angle-optimized-text-embeddings)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/angle-optimized-text-embeddings/semantic-textual-similarity-on-sts13)](https://paperswithcode.com/sota/semantic-textual-similarity-on-sts13?p=angle-optimized-text-embeddings)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/angle-optimized-text-embeddings/semantic-textual-similarity-on-sts12)](https://paperswithcode.com/sota/semantic-textual-similarity-on-sts12?p=angle-optimized-text-embeddings)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/angle-optimized-text-embeddings/semantic-textual-similarity-on-sts-benchmark)](https://paperswithcode.com/sota/semantic-textual-similarity-on-sts-benchmark?p=angle-optimized-text-embeddings)

đŸ“ĸ **Train/Infer Powerful Sentence Embeddings with AnglE.**
This library is from the paper: [AnglE: Angle-optimized Text Embeddings](https://arxiv.org/abs/2309.12871). It allows for training state-of-the-art BERT/LLM-based sentence embeddings with just a few lines of code. AnglE is also a general sentence embedding inference framework, allowing for infering a variety of transformer-based sentence embeddings.

## ✨ Features

**Loss**:
- 📐 AnglE loss
- ⚖ Contrastive loss
- 📏 CoSENT loss
- â˜•ī¸ Espresso loss (previously known as 2DMSE, detail: [README_ESE](README_ESE.md))

**Backbones**:
- BERT-based models (BERT, RoBERTa, ELECTRA, ALBERT, etc.)
- LLM-based models (LLaMA, Mistral, Qwen, etc.)
- Bi-directional LLM-based models (LLaMA, Mistral, Qwen, OpenELMo, etc.. refer to: https://github.com/WhereIsAI/BiLLM)

**Training**:
- Single-GPU training
- Multi-GPU training


> <a href="http://makeapullrequest.com"><img src="https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat-square" alt="http://makeapullrequest.com" /></a> 
    More features will be added in the future. 

## 🏆 Achievements

📅  May 16, 2024 | Paper "[AnglE: Angle-optimized Text Embeddings](https://arxiv.org/abs/2309.12871)" is accepted by ACL 2024 Main Conference.

📅  Mar 13, 2024 | Paper "[BeLLM: Backward Dependency Enhanced Large Language Model for Sentence Embeddings](https://arxiv.org/abs/2311.05296)" is accepted by NAACL 2024 Main Conference.


📅  Mar 8, 2024 | 🍞 [mixedbread's embedding](https://www.mixedbread.ai/blog/mxbai-embed-large-v1) ([mixedbread-ai/mxbai-embed-large-v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1)) achieves SOTA on the [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) with an average score of **64.68**! The model is trained using AnglE. Congrats mixedbread!


📅  Dec 4, 2023 | Our universal sentence embedding [WhereIsAI/UAE-Large-V1](https://huggingface.co/WhereIsAI/UAE-Large-V1) achieves SOTA on the [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) with an average score of **64.64**! The model is trained using AnglE.


📅 Dec, 2023 | AnglE achieves SOTA performance on the STS Bechmark Semantic Textual Similarity! 


## 🤗 Official Pretrained Models

BERT-based models:

|  🤗 HF | Max Tokens | Pooling Strategy | Scenario |
|----|------|------|------|
| [WhereIsAI/UAE-Large-V1](https://huggingface.co/WhereIsAI/UAE-Large-V1) | 512 | cls | English, General-purpose |
| [WhereIsAI/UAE-Code-Large-V1](https://huggingface.co/WhereIsAI/UAE-Code-Large-V1) |  512 | cls | Code Similarity |
| [WhereIsAI/pubmed-angle-base-en](https://huggingface.co/WhereIsAI/pubmed-angle-base-en) |  512 | cls | Medical Similarity |
| [WhereIsAI/pubmed-angle-large-en](https://huggingface.co/WhereIsAI/pubmed-angle-large-en) |  512 | cls | Medical Similarity |

LLM-based models:

| 🤗 HF (lora weight) | Backbone | Max Tokens | Prompts |  Pooling Strategy | Scenario  |
|----|------|------|------|------|------|
| [SeanLee97/angle-llama-13b-nli](https://huggingface.co/SeanLee97/angle-llama-13b-nli) | NousResearch/Llama-2-13b-hf | 4096 | `Prompts.A` | last token | English, Similarity Measurement | 
| [SeanLee97/angle-llama-7b-nli-v2](https://huggingface.co/SeanLee97/angle-llama-7b-nli-v2) | NousResearch/Llama-2-7b-hf | 4096 | `Prompts.A` | last token | English, Similarity Measurement | 


**💡 You can find more third-party embeddings trained with AnglE in [HuggingFace Collection](https://huggingface.co/collections/SeanLee97/angle-based-embeddings-669a181354729d168a6ead9b)**


## 🚀 Quick Start

### âŦ‡ī¸ Installation

```bash
python -m pip install -U angle-emb
```

### ⌛ Infer BERT-based Model
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QJcA2Mvive4pBxWweTpZz9OgwvE42eJZ?usp=sharing)


1) **With Prompts**: You can specify a prompt with `prompt=YOUR_PROMPT` in `encode` method. If set a prompt, the inputs should be a list of dict or a single dict with key `text`, where `text` is the placeholder in the prompt for the input text. You can use other placeholder names. We provide a set of predefined prompts in `Prompts` class, you can check them via `Prompts.list_prompts()`.

```python
from angle_emb import AnglE, Prompts
from angle_emb.utils import cosine_similarity


angle = AnglE.from_pretrained('WhereIsAI/UAE-Large-V1', pooling_strategy='cls').cuda()
# For retrieval tasks, we use `Prompts.C` as the prompt for the query when using UAE-Large-V1 (no need to specify prompt for documents).
# When specify prompt, the inputs should be a list of dict with key 'text'
qv = angle.encode({'text': 'what is the weather?'}, to_numpy=True, prompt=Prompts.C)
doc_vecs = angle.encode([
    'The weather is great!',
    'it is rainy today.',
    'i am going to bed'
], to_numpy=True)

for dv in doc_vecs:
    print(cosine_similarity(qv[0], dv))
```

2) **Without Prompts**: no need to specify a prompt. Just input a list of strings or a single string.

```python
from angle_emb import AnglE
from angle_emb.utils import cosine_similarity


angle = AnglE.from_pretrained('WhereIsAI/UAE-Large-V1', pooling_strategy='cls').cuda()
# for non-retrieval tasks, we don't need to specify prompt when using UAE-Large-V1.
doc_vecs = angle.encode([
    'The weather is great!',
    'The weather is very good!',
    'i am going to bed'
])

for i, dv1 in enumerate(doc_vecs):
    for dv2 in doc_vecs[i+1:]:
        print(cosine_similarity(dv1, dv2))
```


### ⌛ Infer LLM-based Models
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QJcA2Mvive4pBxWweTpZz9OgwvE42eJZ?usp=sharing)

If the pretrained weight is a LoRA-based model, you need to specify the backbone via `model_name_or_path` and specify the LoRA path via the `pretrained_lora_path` in `from_pretrained` method. 

```python
import torch
from angle_emb import AnglE, Prompts
from angle_emb.utils import cosine_similarity

angle = AnglE.from_pretrained('NousResearch/Llama-2-7b-hf',
                              pretrained_lora_path='SeanLee97/angle-llama-7b-nli-v2',
                              pooling_strategy='last',
                              is_llm=True,
                              torch_dtype=torch.float16).cuda()
print('All predefined prompts:', Prompts.list_prompts())
doc_vecs = angle.encode([
    {'text': 'The weather is great!'},
    {'text': 'The weather is very good!'},
    {'text': 'i am going to bed'}
], prompt=Prompts.A)

for i, dv1 in enumerate(doc_vecs):
    for dv2 in doc_vecs[i+1:]:
        print(cosine_similarity(dv1, dv2))
```


### ⌛ Infer BiLLM-based Models
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QJcA2Mvive4pBxWweTpZz9OgwvE42eJZ?usp=sharing)

Specify `apply_billm` and `billm_model_class` to load and infer billm models


```python
import os
# set an environment variable for billm start index
os.environ['BiLLM_START_INDEX'] = '31'

import torch
from angle_emb import AnglE, Prompts
from angle_emb.utils import cosine_similarity

# specify `apply_billm` and `billm_model_class` to load billm models
angle = AnglE.from_pretrained('NousResearch/Llama-2-7b-hf',
                              pretrained_lora_path='SeanLee97/bellm-llama-7b-nli',
                              pooling_strategy='last',
                              is_llm=True,
                              apply_billm=True,
                              billm_model_class='LlamaForCausalLM',
                              torch_dtype=torch.float16).cuda()

doc_vecs = angle.encode([
    {'text': 'The weather is great!'},
    {'text': 'The weather is very good!'},
    {'text': 'i am going to bed'}
], prompt='The representative word for sentence {text} is:"')

for i, dv1 in enumerate(doc_vecs):
    for dv2 in doc_vecs[i+1:]:
        print(cosine_similarity(dv1, dv2))
```


### ⌛ Infer Espresso/Matryoshka Models
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QJcA2Mvive4pBxWweTpZz9OgwvE42eJZ?usp=sharing)

Specify `layer_index` and `embedding_size` to truncate embeddings.


```python
from angle_emb import AnglE
from angle_emb.utils import cosine_similarity


angle = AnglE.from_pretrained('mixedbread-ai/mxbai-embed-2d-large-v1', pooling_strategy='cls').cuda()
# truncate layer
angle = angle.truncate_layer(layer_index=22)
# specify embedding size to truncate embeddings
doc_vecs = angle.encode([
    'The weather is great!',
    'The weather is very good!',
    'i am going to bed'
], embedding_size=768)

for i, dv1 in enumerate(doc_vecs):
    for dv2 in doc_vecs[i+1:]:
        print(cosine_similarity(dv1, dv2))
```

### ⌛ Infer Third-party Models

You can load any transformer-based third-party models such as `mixedbread-ai/mxbai-embed-large-v1`, `sentence-transformers/all-MiniLM-L6-v2`, and `BAAI/bge-large-en-v1.5` using `angle_emb`.

Here is an example:

```python
from angle_emb import AnglE

model = AnglE.from_pretrained('mixedbread-ai/mxbai-embed-large-v1', pooling_strategy='cls').cuda()
vec = model.encode('hello world', to_numpy=True)
print(vec)
```

## Batch Inference

It is recommended to use Mixedbread's `batched` library to speed up the inference process.

```bash
python -m pip install batched
```

```python
import batched
from angle_emb import AnglE

model = AnglE.from_pretrained("WhereIsAI/UAE-Large-V1", pooling_strategy='cls').cuda()
model.encode = batched.dynamically(model.encode, batch_size=64)

vecs = model.encode([
    'The weather is great!',
    'The weather is very good!',
    'i am going to bed'
] * 50)
```

## đŸ•¸ī¸ Custom Train

💡 For more details, please refer to the [training and fintuning](https://angle.readthedocs.io/en/latest/notes/training.html).


### đŸ—‚ī¸ 1. Data Prepation

We currently support three dataset formats:

1) `DatasetFormats.A`: it is a pair format with three columns: `text1`, `text2`, and `label` (0/1).

2) `DatasetFormats.B`: it is a triple format with three columns: `text`, `positive`, and `negative`. `positive` and `negative` store the positive and negative samples of `text`.

3) `DatasetFormats.C`: it is a pair format with two columns: `text`, `positive`. `positive` store the positive sample of `text`.

You need to prepare your data into huggingface `datasets.Dataset` in one of the formats in terms of your supervised data.

### 🚂 2. Train with CLI [Recommended]

Use `angle-trainer` to train your AnglE model in cli mode. 

1) Single gpu training:

Usage: 

```bash
CUDA_VISIBLE_DEVICES=0 angle-trainer --help
```

2) Multi-gpu training:

Usage:

```bash
CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 --master_port=1234 -m angle_emb.angle_trainer --help
```

### 🚂 3. Custom Train

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1h28jHvv_x-0fZ0tItIMjf8rJGp3GcO5V?usp=sharing)


```python
from datasets import load_dataset
from angle_emb import AnglE, AngleDataTokenizer


# 1. load pretrained model
angle = AnglE.from_pretrained('SeanLee97/angle-bert-base-uncased-nli-en-v1', max_length=128, pooling_strategy='cls').cuda()

# 2. load dataset
# `text1`, `text2`, and `label` are three required columns.
ds = load_dataset('mteb/stsbenchmark-sts')
ds = ds.map(lambda obj: {"text1": str(obj["sentence1"]), "text2": str(obj['sentence2']), "label": obj['score']})
ds = ds.select_columns(["text1", "text2", "label"])

# 3. transform data
train_ds = ds['train'].shuffle().map(AngleDataTokenizer(angle.tokenizer, angle.max_length), num_proc=8)
valid_ds = ds['validation'].map(AngleDataTokenizer(angle.tokenizer, angle.max_length), num_proc=8)

# 4. fit
angle.fit(
    train_ds=train_ds,
    valid_ds=valid_ds,
    output_dir='ckpts/sts-b',
    batch_size=32,
    epochs=5,
    learning_rate=2e-5,
    save_steps=100,
    eval_steps=1000,
    warmup_steps=0,
    gradient_accumulation_steps=1,
    loss_kwargs={
        'cosine_w': 1.0,
        'ibn_w': 1.0,
        'cln_w': 1.0,
        'angle_w': 0.02,
        'cosine_tau': 20,
        'ibn_tau': 20,
        'angle_tau': 20
    },
    fp16=True,
    logging_steps=100
)

# 5. evaluate
corrcoef = angle.evaluate(ds['test'])
print('Spearman\'s corrcoef:', corrcoef)
```

### 💡 Others

- To enable `llm` training, please specify `--is_llm 1` and configure appropriate LoRA hyperparameters.
- To enable `billm` training, please specify `--apply_billm 1` and configure appropriate `billm_model_class` such as `LLamaForCausalLM` (refer to: https://github.com/WhereIsAI/BiLLM?tab=readme-ov-file#usage).
- To enable espresso sentence embeddings (ESE), please specify `--apply_ese 1` and configure appropriate ESE hyperparameters via `--ese_kl_temperature float` and `--ese_compression_size integer`.
- To convert the trained AnglE models to `sentence-transformers`, please run `python scripts/convert_to_sentence_transformers.py --help` for more details.


## 💡 4. Fine-tuning Tips

For more details, please refer to the [documentation](https://angle.readthedocs.io/en/latest/notes/training.html#fine-tuning-tips).

1ī¸âƒŖ If your dataset format is `DatasetFormats.A`, it is recommended to slightly increase the weight for `cosine_w` or slightly decrease the weight for `ibn_w`.

2ī¸âƒŖ If your dataset format is `DatasetFormats.B`, it is recommended to set `cosine_w` to 0, and set `angle_w` to a small value like 0.02. Be sure to set `cln_w` and `ibn_w`.

3ī¸âƒŖ If your dataset format is `DatasetFormats.C`, only `ibn_w` and `ibn_tau` are effective. You don't need to tune other parameters.

4ī¸âƒŖ To alleviate information forgetting in fine-tuning, it is better to specify the `teacher_name_or_path`. If the `teacher_name_or_path` equals `model_name_or_path`, it will conduct self-distillation. **It is worth to note that** `teacher_name_or_path` has to have the same tokenizer as `model_name_or_path`. Or it will lead to unexpected results.


## 5. Finetuning and Infering AnglE with `sentence-transformers`

- **Training:** SentenceTransformers also provides a implementation of [AnglE loss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#angleloss). **But it is partially implemented and may not work well as the official code. We recommend to use the official `angle_emb` for fine-tuning AnglE model.**

- **Infering:** If your model is trained with `angle_emb`, and you want to use it with `sentence-transformers`. You can convert it to `sentence-transformers` model using the script `examples/convert_to_sentence_transformers.py`.


# đŸĢĄ Citation

You are welcome to use our code and pre-trained models. If you use our code and pre-trained models, please support us by citing our work as follows:

```bibtex
@article{li2023angle,
  title={AnglE-optimized Text Embeddings},
  author={Li, Xianming and Li, Jing},
  journal={arXiv preprint arXiv:2309.12871},
  year={2023}
}
```

# 📜 ChangeLogs

| 📅 | Description |
|----|------|
| 2024 May 21 |  support Espresso Sentence Embeddings  |
| 2024 Feb 7 |  support training with only positive pairs (`DatasetFormats.C`)  |
| 2023 Dec 4 |  Release a universal English sentence embedding model: [WhereIsAI/UAE-Large-V1](https://huggingface.co/WhereIsAI/UAE-Large-V1)  |
| 2023 Nov 2 |  Release an English pretrained model: `SeanLee97/angle-llama-13b-nli` |
| 2023 Oct 28 |  Release two chinese pretrained models: `SeanLee97/angle-roberta-wwm-base-zhnli-v1` and `SeanLee97/angle-llama-7b-zhnli-v1`; Add chinese README.md |

# 📧 Contact

If you have any questions or suggestions, please feel free to contact us via email: xmlee97@gmail.com

# Š License

This project is licensed under the MIT License.
For the pretrained models, please refer to the corresponding license of the models.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "angle-emb",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "angle_emb",
    "author": "sean lee",
    "author_email": "xmlee97@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/a9/8c/cf951980ca0109576a62f35975656c1d6bc3d19b5f85578b30b438f63242/angle_emb-0.5.6.tar.gz",
    "platform": null,
    "description": "<small>EN | [\u7b80\u4f53\u4e2d\u6587](README_zh.md) </small>\n\n# AnglE \ud83d\udcd0\n> <small>Sponsored by <a href=\"https://www.mixedbread.ai/\">Mixedbread</a></small>\n\n**For more detailed usage, please read the \ud83d\udcd8 document:** https://angle.readthedocs.io/en/latest/index.html\n\n<a href=\"https://arxiv.org/abs/2309.12871\">\n    <img src=\"https://img.shields.io/badge/Arxiv-2309.12871-yellow.svg?style=flat-square\" alt=\"https://arxiv.org/abs/2309.12871\" />\n</a>\n<a href=\"https://pypi.org/project/angle_emb/\">\n    <img src=\"https://img.shields.io/pypi/v/angle_emb?style=flat-square\" alt=\"PyPI version\" />\n</a>\n<a href=\"https://pypi.org/project/angle_emb/\">\n    <img src=\"https://img.shields.io/pypi/dm/angle_emb?style=flat-square\" alt=\"PyPI Downloads\" />\n</a>\n<a href=\"https://angle.readthedocs.io/en/latest/index.html\">\n    <img src=\"https://readthedocs.org/projects/angle/badge/?version=latest&style=flat-square\" alt=\"Read the docs\" />\n</a>\n\n\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/angle-optimized-text-embeddings/semantic-textual-similarity-on-sick-r-1)](https://paperswithcode.com/sota/semantic-textual-similarity-on-sick-r-1?p=angle-optimized-text-embeddings)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/angle-optimized-text-embeddings/semantic-textual-similarity-on-sts16)](https://paperswithcode.com/sota/semantic-textual-similarity-on-sts16?p=angle-optimized-text-embeddings)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/angle-optimized-text-embeddings/semantic-textual-similarity-on-sts15)](https://paperswithcode.com/sota/semantic-textual-similarity-on-sts15?p=angle-optimized-text-embeddings)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/angle-optimized-text-embeddings/semantic-textual-similarity-on-sts14)](https://paperswithcode.com/sota/semantic-textual-similarity-on-sts14?p=angle-optimized-text-embeddings)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/angle-optimized-text-embeddings/semantic-textual-similarity-on-sts13)](https://paperswithcode.com/sota/semantic-textual-similarity-on-sts13?p=angle-optimized-text-embeddings)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/angle-optimized-text-embeddings/semantic-textual-similarity-on-sts12)](https://paperswithcode.com/sota/semantic-textual-similarity-on-sts12?p=angle-optimized-text-embeddings)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/angle-optimized-text-embeddings/semantic-textual-similarity-on-sts-benchmark)](https://paperswithcode.com/sota/semantic-textual-similarity-on-sts-benchmark?p=angle-optimized-text-embeddings)\n\n\ud83d\udce2 **Train/Infer Powerful Sentence Embeddings with AnglE.**\nThis library is from the paper: [AnglE: Angle-optimized Text Embeddings](https://arxiv.org/abs/2309.12871). It allows for training state-of-the-art BERT/LLM-based sentence embeddings with just a few lines of code. AnglE is also a general sentence embedding inference framework, allowing for infering a variety of transformer-based sentence embeddings.\n\n## \u2728 Features\n\n**Loss**:\n- \ud83d\udcd0 AnglE loss\n- \u2696 Contrastive loss\n- \ud83d\udccf CoSENT loss\n- \u2615\ufe0f Espresso loss (previously known as 2DMSE, detail: [README_ESE](README_ESE.md))\n\n**Backbones**:\n- BERT-based models (BERT, RoBERTa, ELECTRA, ALBERT, etc.)\n- LLM-based models (LLaMA, Mistral, Qwen, etc.)\n- Bi-directional LLM-based models (LLaMA, Mistral, Qwen, OpenELMo, etc.. refer to: https://github.com/WhereIsAI/BiLLM)\n\n**Training**:\n- Single-GPU training\n- Multi-GPU training\n\n\n> <a href=\"http://makeapullrequest.com\"><img src=\"https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat-square\" alt=\"http://makeapullrequest.com\" /></a> \n    More features will be added in the future. \n\n## \ud83c\udfc6 Achievements\n\n\ud83d\udcc5  May 16, 2024 | Paper \"[AnglE: Angle-optimized Text Embeddings](https://arxiv.org/abs/2309.12871)\" is accepted by ACL 2024 Main Conference.\n\n\ud83d\udcc5  Mar 13, 2024 | Paper \"[BeLLM: Backward Dependency Enhanced Large Language Model for Sentence Embeddings](https://arxiv.org/abs/2311.05296)\" is accepted by NAACL 2024 Main Conference.\n\n\n\ud83d\udcc5  Mar 8, 2024 | \ud83c\udf5e [mixedbread's embedding](https://www.mixedbread.ai/blog/mxbai-embed-large-v1) ([mixedbread-ai/mxbai-embed-large-v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1)) achieves SOTA on the [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) with an average score of **64.68**! The model is trained using AnglE. Congrats mixedbread!\n\n\n\ud83d\udcc5  Dec 4, 2023 | Our universal sentence embedding [WhereIsAI/UAE-Large-V1](https://huggingface.co/WhereIsAI/UAE-Large-V1) achieves SOTA on the [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) with an average score of **64.64**! The model is trained using AnglE.\n\n\n\ud83d\udcc5 Dec, 2023 | AnglE achieves SOTA performance on the STS Bechmark Semantic Textual Similarity! \n\n\n## \ud83e\udd17 Official Pretrained Models\n\nBERT-based models:\n\n|  \ud83e\udd17 HF | Max Tokens | Pooling Strategy | Scenario |\n|----|------|------|------|\n| [WhereIsAI/UAE-Large-V1](https://huggingface.co/WhereIsAI/UAE-Large-V1) | 512 | cls | English, General-purpose |\n| [WhereIsAI/UAE-Code-Large-V1](https://huggingface.co/WhereIsAI/UAE-Code-Large-V1) |  512 | cls | Code Similarity |\n| [WhereIsAI/pubmed-angle-base-en](https://huggingface.co/WhereIsAI/pubmed-angle-base-en) |  512 | cls | Medical Similarity |\n| [WhereIsAI/pubmed-angle-large-en](https://huggingface.co/WhereIsAI/pubmed-angle-large-en) |  512 | cls | Medical Similarity |\n\nLLM-based models:\n\n| \ud83e\udd17 HF (lora weight) | Backbone | Max Tokens | Prompts |  Pooling Strategy | Scenario  |\n|----|------|------|------|------|------|\n| [SeanLee97/angle-llama-13b-nli](https://huggingface.co/SeanLee97/angle-llama-13b-nli) | NousResearch/Llama-2-13b-hf | 4096 | `Prompts.A` | last token | English, Similarity Measurement | \n| [SeanLee97/angle-llama-7b-nli-v2](https://huggingface.co/SeanLee97/angle-llama-7b-nli-v2) | NousResearch/Llama-2-7b-hf | 4096 | `Prompts.A` | last token | English, Similarity Measurement | \n\n\n**\ud83d\udca1 You can find more third-party embeddings trained with AnglE in [HuggingFace Collection](https://huggingface.co/collections/SeanLee97/angle-based-embeddings-669a181354729d168a6ead9b)**\n\n\n## \ud83d\ude80 Quick Start\n\n### \u2b07\ufe0f Installation\n\n```bash\npython -m pip install -U angle-emb\n```\n\n### \u231b Infer BERT-based Model\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QJcA2Mvive4pBxWweTpZz9OgwvE42eJZ?usp=sharing)\n\n\n1) **With Prompts**: You can specify a prompt with `prompt=YOUR_PROMPT` in `encode` method. If set a prompt, the inputs should be a list of dict or a single dict with key `text`, where `text` is the placeholder in the prompt for the input text. You can use other placeholder names. We provide a set of predefined prompts in `Prompts` class, you can check them via `Prompts.list_prompts()`.\n\n```python\nfrom angle_emb import AnglE, Prompts\nfrom angle_emb.utils import cosine_similarity\n\n\nangle = AnglE.from_pretrained('WhereIsAI/UAE-Large-V1', pooling_strategy='cls').cuda()\n# For retrieval tasks, we use `Prompts.C` as the prompt for the query when using UAE-Large-V1 (no need to specify prompt for documents).\n# When specify prompt, the inputs should be a list of dict with key 'text'\nqv = angle.encode({'text': 'what is the weather?'}, to_numpy=True, prompt=Prompts.C)\ndoc_vecs = angle.encode([\n    'The weather is great!',\n    'it is rainy today.',\n    'i am going to bed'\n], to_numpy=True)\n\nfor dv in doc_vecs:\n    print(cosine_similarity(qv[0], dv))\n```\n\n2) **Without Prompts**: no need to specify a prompt. Just input a list of strings or a single string.\n\n```python\nfrom angle_emb import AnglE\nfrom angle_emb.utils import cosine_similarity\n\n\nangle = AnglE.from_pretrained('WhereIsAI/UAE-Large-V1', pooling_strategy='cls').cuda()\n# for non-retrieval tasks, we don't need to specify prompt when using UAE-Large-V1.\ndoc_vecs = angle.encode([\n    'The weather is great!',\n    'The weather is very good!',\n    'i am going to bed'\n])\n\nfor i, dv1 in enumerate(doc_vecs):\n    for dv2 in doc_vecs[i+1:]:\n        print(cosine_similarity(dv1, dv2))\n```\n\n\n### \u231b Infer LLM-based Models\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QJcA2Mvive4pBxWweTpZz9OgwvE42eJZ?usp=sharing)\n\nIf the pretrained weight is a LoRA-based model, you need to specify the backbone via `model_name_or_path` and specify the LoRA path via the `pretrained_lora_path` in `from_pretrained` method. \n\n```python\nimport torch\nfrom angle_emb import AnglE, Prompts\nfrom angle_emb.utils import cosine_similarity\n\nangle = AnglE.from_pretrained('NousResearch/Llama-2-7b-hf',\n                              pretrained_lora_path='SeanLee97/angle-llama-7b-nli-v2',\n                              pooling_strategy='last',\n                              is_llm=True,\n                              torch_dtype=torch.float16).cuda()\nprint('All predefined prompts:', Prompts.list_prompts())\ndoc_vecs = angle.encode([\n    {'text': 'The weather is great!'},\n    {'text': 'The weather is very good!'},\n    {'text': 'i am going to bed'}\n], prompt=Prompts.A)\n\nfor i, dv1 in enumerate(doc_vecs):\n    for dv2 in doc_vecs[i+1:]:\n        print(cosine_similarity(dv1, dv2))\n```\n\n\n### \u231b Infer BiLLM-based Models\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QJcA2Mvive4pBxWweTpZz9OgwvE42eJZ?usp=sharing)\n\nSpecify `apply_billm` and `billm_model_class` to load and infer billm models\n\n\n```python\nimport os\n# set an environment variable for billm start index\nos.environ['BiLLM_START_INDEX'] = '31'\n\nimport torch\nfrom angle_emb import AnglE, Prompts\nfrom angle_emb.utils import cosine_similarity\n\n# specify `apply_billm` and `billm_model_class` to load billm models\nangle = AnglE.from_pretrained('NousResearch/Llama-2-7b-hf',\n                              pretrained_lora_path='SeanLee97/bellm-llama-7b-nli',\n                              pooling_strategy='last',\n                              is_llm=True,\n                              apply_billm=True,\n                              billm_model_class='LlamaForCausalLM',\n                              torch_dtype=torch.float16).cuda()\n\ndoc_vecs = angle.encode([\n    {'text': 'The weather is great!'},\n    {'text': 'The weather is very good!'},\n    {'text': 'i am going to bed'}\n], prompt='The representative word for sentence {text} is:\"')\n\nfor i, dv1 in enumerate(doc_vecs):\n    for dv2 in doc_vecs[i+1:]:\n        print(cosine_similarity(dv1, dv2))\n```\n\n\n### \u231b Infer Espresso/Matryoshka Models\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QJcA2Mvive4pBxWweTpZz9OgwvE42eJZ?usp=sharing)\n\nSpecify `layer_index` and `embedding_size` to truncate embeddings.\n\n\n```python\nfrom angle_emb import AnglE\nfrom angle_emb.utils import cosine_similarity\n\n\nangle = AnglE.from_pretrained('mixedbread-ai/mxbai-embed-2d-large-v1', pooling_strategy='cls').cuda()\n# truncate layer\nangle = angle.truncate_layer(layer_index=22)\n# specify embedding size to truncate embeddings\ndoc_vecs = angle.encode([\n    'The weather is great!',\n    'The weather is very good!',\n    'i am going to bed'\n], embedding_size=768)\n\nfor i, dv1 in enumerate(doc_vecs):\n    for dv2 in doc_vecs[i+1:]:\n        print(cosine_similarity(dv1, dv2))\n```\n\n### \u231b Infer Third-party Models\n\nYou can load any transformer-based third-party models such as `mixedbread-ai/mxbai-embed-large-v1`, `sentence-transformers/all-MiniLM-L6-v2`, and `BAAI/bge-large-en-v1.5` using `angle_emb`.\n\nHere is an example:\n\n```python\nfrom angle_emb import AnglE\n\nmodel = AnglE.from_pretrained('mixedbread-ai/mxbai-embed-large-v1', pooling_strategy='cls').cuda()\nvec = model.encode('hello world', to_numpy=True)\nprint(vec)\n```\n\n## Batch Inference\n\nIt is recommended to use Mixedbread's `batched` library to speed up the inference process.\n\n```bash\npython -m pip install batched\n```\n\n```python\nimport batched\nfrom angle_emb import AnglE\n\nmodel = AnglE.from_pretrained(\"WhereIsAI/UAE-Large-V1\", pooling_strategy='cls').cuda()\nmodel.encode = batched.dynamically(model.encode, batch_size=64)\n\nvecs = model.encode([\n    'The weather is great!',\n    'The weather is very good!',\n    'i am going to bed'\n] * 50)\n```\n\n## \ud83d\udd78\ufe0f Custom Train\n\n\ud83d\udca1 For more details, please refer to the [training and fintuning](https://angle.readthedocs.io/en/latest/notes/training.html).\n\n\n### \ud83d\uddc2\ufe0f 1. Data Prepation\n\nWe currently support three dataset formats:\n\n1) `DatasetFormats.A`: it is a pair format with three columns: `text1`, `text2`, and `label` (0/1).\n\n2) `DatasetFormats.B`: it is a triple format with three columns: `text`, `positive`, and `negative`. `positive` and `negative` store the positive and negative samples of `text`.\n\n3) `DatasetFormats.C`: it is a pair format with two columns: `text`, `positive`. `positive` store the positive sample of `text`.\n\nYou need to prepare your data into huggingface `datasets.Dataset` in one of the formats in terms of your supervised data.\n\n### \ud83d\ude82 2. Train with CLI [Recommended]\n\nUse `angle-trainer` to train your AnglE model in cli mode. \n\n1) Single gpu training:\n\nUsage: \n\n```bash\nCUDA_VISIBLE_DEVICES=0 angle-trainer --help\n```\n\n2) Multi-gpu training:\n\nUsage:\n\n```bash\nCUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 --master_port=1234 -m angle_emb.angle_trainer --help\n```\n\n### \ud83d\ude82 3. Custom Train\n\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1h28jHvv_x-0fZ0tItIMjf8rJGp3GcO5V?usp=sharing)\n\n\n```python\nfrom datasets import load_dataset\nfrom angle_emb import AnglE, AngleDataTokenizer\n\n\n# 1. load pretrained model\nangle = AnglE.from_pretrained('SeanLee97/angle-bert-base-uncased-nli-en-v1', max_length=128, pooling_strategy='cls').cuda()\n\n# 2. load dataset\n# `text1`, `text2`, and `label` are three required columns.\nds = load_dataset('mteb/stsbenchmark-sts')\nds = ds.map(lambda obj: {\"text1\": str(obj[\"sentence1\"]), \"text2\": str(obj['sentence2']), \"label\": obj['score']})\nds = ds.select_columns([\"text1\", \"text2\", \"label\"])\n\n# 3. transform data\ntrain_ds = ds['train'].shuffle().map(AngleDataTokenizer(angle.tokenizer, angle.max_length), num_proc=8)\nvalid_ds = ds['validation'].map(AngleDataTokenizer(angle.tokenizer, angle.max_length), num_proc=8)\n\n# 4. fit\nangle.fit(\n    train_ds=train_ds,\n    valid_ds=valid_ds,\n    output_dir='ckpts/sts-b',\n    batch_size=32,\n    epochs=5,\n    learning_rate=2e-5,\n    save_steps=100,\n    eval_steps=1000,\n    warmup_steps=0,\n    gradient_accumulation_steps=1,\n    loss_kwargs={\n        'cosine_w': 1.0,\n        'ibn_w': 1.0,\n        'cln_w': 1.0,\n        'angle_w': 0.02,\n        'cosine_tau': 20,\n        'ibn_tau': 20,\n        'angle_tau': 20\n    },\n    fp16=True,\n    logging_steps=100\n)\n\n# 5. evaluate\ncorrcoef = angle.evaluate(ds['test'])\nprint('Spearman\\'s corrcoef:', corrcoef)\n```\n\n### \ud83d\udca1 Others\n\n- To enable `llm` training, please specify `--is_llm 1` and configure appropriate LoRA hyperparameters.\n- To enable `billm` training, please specify `--apply_billm 1` and configure appropriate `billm_model_class` such as `LLamaForCausalLM` (refer to: https://github.com/WhereIsAI/BiLLM?tab=readme-ov-file#usage).\n- To enable espresso sentence embeddings (ESE), please specify `--apply_ese 1` and configure appropriate ESE hyperparameters via `--ese_kl_temperature float` and `--ese_compression_size integer`.\n- To convert the trained AnglE models to `sentence-transformers`, please run `python scripts/convert_to_sentence_transformers.py --help` for more details.\n\n\n## \ud83d\udca1 4. Fine-tuning Tips\n\nFor more details, please refer to the [documentation](https://angle.readthedocs.io/en/latest/notes/training.html#fine-tuning-tips).\n\n1\ufe0f\u20e3 If your dataset format is `DatasetFormats.A`, it is recommended to slightly increase the weight for `cosine_w` or slightly decrease the weight for `ibn_w`.\n\n2\ufe0f\u20e3 If your dataset format is `DatasetFormats.B`, it is recommended to set `cosine_w` to 0, and set `angle_w` to a small value like 0.02. Be sure to set `cln_w` and `ibn_w`.\n\n3\ufe0f\u20e3 If your dataset format is `DatasetFormats.C`, only `ibn_w` and `ibn_tau` are effective. You don't need to tune other parameters.\n\n4\ufe0f\u20e3 To alleviate information forgetting in fine-tuning, it is better to specify the `teacher_name_or_path`. If the `teacher_name_or_path` equals `model_name_or_path`, it will conduct self-distillation. **It is worth to note that** `teacher_name_or_path` has to have the same tokenizer as `model_name_or_path`. Or it will lead to unexpected results.\n\n\n## 5. Finetuning and Infering AnglE with `sentence-transformers`\n\n- **Training:** SentenceTransformers also provides a implementation of [AnglE loss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#angleloss). **But it is partially implemented and may not work well as the official code. We recommend to use the official `angle_emb` for fine-tuning AnglE model.**\n\n- **Infering:** If your model is trained with `angle_emb`, and you want to use it with `sentence-transformers`. You can convert it to `sentence-transformers` model using the script `examples/convert_to_sentence_transformers.py`.\n\n\n# \ud83e\udee1 Citation\n\nYou are welcome to use our code and pre-trained models. If you use our code and pre-trained models, please support us by citing our work as follows:\n\n```bibtex\n@article{li2023angle,\n  title={AnglE-optimized Text Embeddings},\n  author={Li, Xianming and Li, Jing},\n  journal={arXiv preprint arXiv:2309.12871},\n  year={2023}\n}\n```\n\n# \ud83d\udcdc ChangeLogs\n\n| \ud83d\udcc5 | Description |\n|----|------|\n| 2024 May 21 |  support Espresso Sentence Embeddings  |\n| 2024 Feb 7 |  support training with only positive pairs (`DatasetFormats.C`)  |\n| 2023 Dec 4 |  Release a universal English sentence embedding model: [WhereIsAI/UAE-Large-V1](https://huggingface.co/WhereIsAI/UAE-Large-V1)  |\n| 2023 Nov 2 |  Release an English pretrained model: `SeanLee97/angle-llama-13b-nli` |\n| 2023 Oct 28 |  Release two chinese pretrained models: `SeanLee97/angle-roberta-wwm-base-zhnli-v1` and `SeanLee97/angle-llama-7b-zhnli-v1`; Add chinese README.md |\n\n# \ud83d\udce7 Contact\n\nIf you have any questions or suggestions, please feel free to contact us via email: xmlee97@gmail.com\n\n# \u00a9 License\n\nThis project is licensed under the MIT License.\nFor the pretrained models, please refer to the corresponding license of the models.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "AnglE-optimize Text Embeddings",
    "version": "0.5.6",
    "project_urls": null,
    "split_keywords": [
        "angle_emb"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8bae730858e831f6b9f4b1bf62e967905d5053d922e7676b32cf070da3912713",
                "md5": "5e234f9e75318c99d50cda44bbc71c88",
                "sha256": "e0ce21c502101772e64bebbb143da02894d946113df35ab1a25471d017182d2b"
            },
            "downloads": -1,
            "filename": "angle_emb-0.5.6-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "5e234f9e75318c99d50cda44bbc71c88",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 29659,
            "upload_time": "2025-01-15T06:14:13",
            "upload_time_iso_8601": "2025-01-15T06:14:13.888047Z",
            "url": "https://files.pythonhosted.org/packages/8b/ae/730858e831f6b9f4b1bf62e967905d5053d922e7676b32cf070da3912713/angle_emb-0.5.6-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a98ccf951980ca0109576a62f35975656c1d6bc3d19b5f85578b30b438f63242",
                "md5": "8040ad4aa1f263a9bd6c28b5315c52fe",
                "sha256": "88b3b96bcdf48a7b133ae5775392005cf9088e8829f9e7555a9ee784c66ecad1"
            },
            "downloads": -1,
            "filename": "angle_emb-0.5.6.tar.gz",
            "has_sig": false,
            "md5_digest": "8040ad4aa1f263a9bd6c28b5315c52fe",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 34636,
            "upload_time": "2025-01-15T06:14:16",
            "upload_time_iso_8601": "2025-01-15T06:14:16.795190Z",
            "url": "https://files.pythonhosted.org/packages/a9/8c/cf951980ca0109576a62f35975656c1d6bc3d19b5f85578b30b438f63242/angle_emb-0.5.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-15 06:14:16",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "angle-emb"
}
        
Elapsed time: 0.59714s