[license-image]: https://img.shields.io/badge/License-Apache%202.0-blue.svg
[license-url]: https://opensource.org/licenses/Apache-2.0
[pypi-image]: https://badge.fury.io/py/open-retrievals.svg
[pypi-url]: https://pypi.org/project/open-retrievals
[pepy-image]: https://pepy.tech/badge/retrievals/month
[pepy-url]: https://pepy.tech/project/retrievals
[build-image]: https://github.com/LongxingTan/open-retrievals/actions/workflows/test.yml/badge.svg?branch=master
[build-url]: https://github.com/LongxingTan/open-retrievals/actions/workflows/test.yml?query=branch%3Amaster
[lint-image]: https://github.com/LongxingTan/open-retrievals/actions/workflows/lint.yml/badge.svg?branch=master
[lint-url]: https://github.com/LongxingTan/open-retrievals/actions/workflows/lint.yml?query=branch%3Amaster
[docs-image]: https://readthedocs.org/projects/open-retrievals/badge/?version=latest
[docs-url]: https://open-retrievals.readthedocs.io/en/latest/?version=latest
[coverage-image]: https://codecov.io/gh/longxingtan/open-retrievals/branch/master/graph/badge.svg
[coverage-url]: https://codecov.io/github/longxingtan/open-retrievals?branch=master
<h1 align="center">
<img src="./docs/source/_static/logo.svg" width="520" align=center/>
</h1><br>
[![LICENSE][license-image]][license-url]
[![PyPI Version][pypi-image]][pypi-url]
[![Build Status][build-image]][build-url]
[![Lint Status][lint-image]][lint-url]
[![Docs Status][docs-image]][docs-url]
[![Code Coverage][coverage-image]][coverage-url]
**[Documentation](https://open-retrievals.readthedocs.io)** | **[Tutorials](https://open-retrievals.readthedocs.io/en/latest/tutorials.html)** | **[中文](https://github.com/LongxingTan/open-retrievals/blob/master/README_zh-CN.md)**
**Open-Retrievals** is an easy-to-use python framework getting SOTA text embeddings, oriented to information retrieval and LLM retrieval augmented generation, based on PyTorch and Transformers.
- Contrastive learning enhanced embeddings
- LLM embeddings
## Installation
**Prerequisites**
```shell
pip install transformers
pip install faiss-cpu
pip install peft
```
**With pip**
```shell
pip install open-retrievals
```
[//]: # (**With conda**)
[//]: # (```shell)
[//]: # (conda install open-retrievals -c conda-forge)
[//]: # (```)
## Quick-start
```python
from retrievals import AutoModelForEmbedding, AutoModelForRetrieval
# Example list of documents
documents = [
"Open-retrievals is a text embedding libraries",
"I can use it simply with a SOTA RAG application.",
]
# This will trigger the model download and initialization
model_name_or_path = "sentence-transformers/all-MiniLM-L6-v2"
model = AutoModelForEmbedding(model_name_or_path)
embeddings = model.encode(documents)
len(embeddings) # Vector of 384 dimensions
```
## Usage
**Build Index and Retrieval**
```python
from retrievals import AutoModelForEmbedding, AutoModelForRetrieval
sentences = ['A dog is chasing car.', 'A man is playing a guitar.']
model_name_or_path = "sentence-transformers/all-MiniLM-L6-v2"
model = AutoModelForEmbedding(model_name_or_path)
model.build_index(sentences)
matcher = AutoModelForRetrieval()
results = matcher.faiss_search("He plays guitar.")
```
**Rerank**
```python
from transformers import AutoTokenizer
from retrievals import RerankCollator, RerankModel, RerankTrainer, RerankDataset
train_dataset = RerankDataset(args=data_args)
tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, use_fast=False)
model = RerankModel(
model_args.model_name_or_path,
pooling_method="mean"
)
optimizer = get_optimizer(model, lr=5e-5, weight_decay=1e-3)
lr_scheduler = get_scheduler(optimizer, num_train_steps=int(len(train_dataset) / 2 * 1))
trainer = RerankTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
data_collator=RerankCollator(tokenizer, max_length=data_args.query_max_len),
)
trainer.optimizer = optimizer
trainer.scheduler = lr_scheduler
trainer.train()
```
**RAG with LangChain**
- Prerequisites
```shell
pip install langchain
```
- Server
```python
from retrievals.tools.langchain import LangchainEmbedding, LangchainReranker
from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.vectorstores import Chroma as Vectorstore
class DenseRetrieval:
def __init__(self, persist_directory):
embeddings = LangchainEmbedding(model_name="BAAI/bge-large-zh-v1.5")
vectordb = Vectorstore(
persist_directory=persist_directory,
embedding_function=embeddings,
)
retrieval_args = {"search_type" :"similarity", "score_threshold": 0.15, "k": 30}
self.retriever = vectordb.as_retriever(retrieval_args)
reranker_args = {
"model": "../../inputs/bce-reranker-base_v1",
"top_n": 7,
"device": "cuda",
"use_fp16": True,
}
self.reranker = LangchainReranker(**reranker_args)
self.compression_retriever = ContextualCompressionRetriever(
base_compressor=self.reranker, base_retriever=self.retriever
)
def query(
self,
question: str
):
docs = self.compression_retriever.get_relevant_documents(question)
return docs
```
[//]: # (**RAG with LLamaIndex**)
[//]: # ()
[//]: # (```shell)
[//]: # (pip install llamaindex)
[//]: # (```)
[//]: # ()
[//]: # ()
[//]: # (```python)
[//]: # ()
[//]: # ()
[//]: # (```)
**Use Pretrained sentence embedding**
```python
from retrievals import AutoModelForEmbedding
sentences = ["Hello world", "How are you?"]
model_name_or_path = "sentence-transformers/all-MiniLM-L6-v2"
model = AutoModelForEmbedding(model_name_or_path, pooling_method="mean", normalize_embeddings=True)
sentence_embeddings = model.encode(sentences, convert_to_tensor=True)
print(sentence_embeddings)
```
**Finetune transformers by contrastive learning**
```python
from transformers import AutoTokenizer
from retrievals import AutoModelForEmbedding, AutoModelForRetrieval, RetrievalTrainer, PairCollator, TripletCollator
from retrievals.losses import ArcFaceAdaptiveMarginLoss, InfoNCE, SimCSE, TripletLoss
from retrievals.data import RetrievalDataset, RerankDataset
train_dataset = RetrievalDataset(args=data_args)
tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, use_fast=False)
model = AutoModelForEmbedding(
model_args.model_name_or_path,
pooling_method="cls"
)
optimizer = get_optimizer(model, lr=5e-5, weight_decay=1e-3)
lr_scheduler = get_scheduler(optimizer, num_train_steps=int(len(train_dataset) / 2 * 1))
trainer = RetrievalTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
data_collator=TripletCollator(tokenizer, max_length=data_args.query_max_len),
loss_fn=TripletLoss(),
)
trainer.optimizer = optimizer
trainer.scheduler = lr_scheduler
trainer.train()
```
**Finetune LLM for embedding by Contrastive learning**
```python
from retrievals import AutoModelForEmbedding
model = AutoModelForEmbedding(
"mistralai/Mistral-7B-v0.1",
pooling_method='cls',
query_instruction=f'Instruct: Retrieve semantically similar text\nQuery: '
)
```
**Search by Cosine similarity/KNN**
```python
from retrievals import AutoModelForEmbedding, AutoModelForRetrieval
query_texts = ['A dog is chasing car.']
passage_texts = ['A man is playing a guitar.', 'A bee is flying low']
model_name_or_path = "sentence-transformers/all-MiniLM-L6-v2"
model = AutoModelForEmbedding(model_name_or_path)
query_embeddings = model.encode(query_texts, convert_to_tensor=True)
passage_embeddings = model.encode(passage_texts, convert_to_tensor=True)
matcher = AutoModelForRetrieval(method='cosine')
dists, indices = matcher.similarity_search(query_embeddings, passage_embeddings, top_k=1)
```
## Reference & Acknowledge
- [sentence-transformers](https://github.com/UKPLab/sentence-transformers)
- [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding)
- [uniem](https://github.com/wangyuxinwhy/uniem)
- [BCEmbedding](https://github.com/netease-youdao/BCEmbedding)
Raw data
{
"_id": null,
"home_page": "https://github.com/LongxingTan/open-retrievals",
"name": "open-retrievals",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8.0",
"maintainer_email": null,
"keywords": "NLP retrieval RAG rerank text embedding contrastive",
"author": "Longxing Tan",
"author_email": "tanlongxing888@163.com",
"download_url": "https://files.pythonhosted.org/packages/53/ca/fdac8064d9cab4e846c64a5e9271867fe884f8c5dde33a8f5c10501d0a16/open-retrievals-0.0.2.tar.gz",
"platform": null,
"description": "[license-image]: https://img.shields.io/badge/License-Apache%202.0-blue.svg\n[license-url]: https://opensource.org/licenses/Apache-2.0\n[pypi-image]: https://badge.fury.io/py/open-retrievals.svg\n[pypi-url]: https://pypi.org/project/open-retrievals\n[pepy-image]: https://pepy.tech/badge/retrievals/month\n[pepy-url]: https://pepy.tech/project/retrievals\n[build-image]: https://github.com/LongxingTan/open-retrievals/actions/workflows/test.yml/badge.svg?branch=master\n[build-url]: https://github.com/LongxingTan/open-retrievals/actions/workflows/test.yml?query=branch%3Amaster\n[lint-image]: https://github.com/LongxingTan/open-retrievals/actions/workflows/lint.yml/badge.svg?branch=master\n[lint-url]: https://github.com/LongxingTan/open-retrievals/actions/workflows/lint.yml?query=branch%3Amaster\n[docs-image]: https://readthedocs.org/projects/open-retrievals/badge/?version=latest\n[docs-url]: https://open-retrievals.readthedocs.io/en/latest/?version=latest\n[coverage-image]: https://codecov.io/gh/longxingtan/open-retrievals/branch/master/graph/badge.svg\n[coverage-url]: https://codecov.io/github/longxingtan/open-retrievals?branch=master\n\n<h1 align=\"center\">\n<img src=\"./docs/source/_static/logo.svg\" width=\"520\" align=center/>\n</h1><br>\n\n[![LICENSE][license-image]][license-url]\n[![PyPI Version][pypi-image]][pypi-url]\n[![Build Status][build-image]][build-url]\n[![Lint Status][lint-image]][lint-url]\n[![Docs Status][docs-image]][docs-url]\n[![Code Coverage][coverage-image]][coverage-url]\n\n\n**[Documentation](https://open-retrievals.readthedocs.io)** | **[Tutorials](https://open-retrievals.readthedocs.io/en/latest/tutorials.html)** | **[\u4e2d\u6587](https://github.com/LongxingTan/open-retrievals/blob/master/README_zh-CN.md)**\n\n**Open-Retrievals** is an easy-to-use python framework getting SOTA text embeddings, oriented to information retrieval and LLM retrieval augmented generation, based on PyTorch and Transformers.\n- Contrastive learning enhanced embeddings\n- LLM embeddings\n\n\n## Installation\n\n**Prerequisites**\n```shell\npip install transformers\npip install faiss-cpu\npip install peft\n```\n\n**With pip**\n```shell\npip install open-retrievals\n```\n\n[//]: # (**With conda**)\n\n[//]: # (```shell)\n\n[//]: # (conda install open-retrievals -c conda-forge)\n\n[//]: # (```)\n\n## Quick-start\n\n```python\nfrom retrievals import AutoModelForEmbedding, AutoModelForRetrieval\n\n# Example list of documents\ndocuments = [\n \"Open-retrievals is a text embedding libraries\",\n \"I can use it simply with a SOTA RAG application.\",\n]\n\n# This will trigger the model download and initialization\nmodel_name_or_path = \"sentence-transformers/all-MiniLM-L6-v2\"\nmodel = AutoModelForEmbedding(model_name_or_path)\n\nembeddings = model.encode(documents)\nlen(embeddings) # Vector of 384 dimensions\n```\n\n\n## Usage\n\n**Build Index and Retrieval**\n```python\nfrom retrievals import AutoModelForEmbedding, AutoModelForRetrieval\n\nsentences = ['A dog is chasing car.', 'A man is playing a guitar.']\nmodel_name_or_path = \"sentence-transformers/all-MiniLM-L6-v2\"\nmodel = AutoModelForEmbedding(model_name_or_path)\nmodel.build_index(sentences)\n\nmatcher = AutoModelForRetrieval()\nresults = matcher.faiss_search(\"He plays guitar.\")\n```\n\n**Rerank**\n```python\nfrom transformers import AutoTokenizer\nfrom retrievals import RerankCollator, RerankModel, RerankTrainer, RerankDataset\n\ntrain_dataset = RerankDataset(args=data_args)\ntokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, use_fast=False)\n\nmodel = RerankModel(\n model_args.model_name_or_path,\n pooling_method=\"mean\"\n)\noptimizer = get_optimizer(model, lr=5e-5, weight_decay=1e-3)\n\nlr_scheduler = get_scheduler(optimizer, num_train_steps=int(len(train_dataset) / 2 * 1))\n\ntrainer = RerankTrainer(\n model=model,\n args=training_args,\n train_dataset=train_dataset,\n data_collator=RerankCollator(tokenizer, max_length=data_args.query_max_len),\n)\ntrainer.optimizer = optimizer\ntrainer.scheduler = lr_scheduler\ntrainer.train()\n```\n\n**RAG with LangChain**\n\n\n- Prerequisites\n\n```shell\npip install langchain\n```\n\n\n- Server\n\n```python\nfrom retrievals.tools.langchain import LangchainEmbedding, LangchainReranker\nfrom langchain.retrievers import ContextualCompressionRetriever\nfrom langchain_community.vectorstores import Chroma as Vectorstore\n\n\nclass DenseRetrieval:\n def __init__(self, persist_directory):\n embeddings = LangchainEmbedding(model_name=\"BAAI/bge-large-zh-v1.5\")\n vectordb = Vectorstore(\n persist_directory=persist_directory,\n embedding_function=embeddings,\n )\n retrieval_args = {\"search_type\" :\"similarity\", \"score_threshold\": 0.15, \"k\": 30}\n self.retriever = vectordb.as_retriever(retrieval_args)\n\n reranker_args = {\n \"model\": \"../../inputs/bce-reranker-base_v1\",\n \"top_n\": 7,\n \"device\": \"cuda\",\n \"use_fp16\": True,\n }\n self.reranker = LangchainReranker(**reranker_args)\n self.compression_retriever = ContextualCompressionRetriever(\n base_compressor=self.reranker, base_retriever=self.retriever\n )\n\n def query(\n self,\n question: str\n ):\n docs = self.compression_retriever.get_relevant_documents(question)\n return docs\n```\n\n[//]: # (**RAG with LLamaIndex**)\n\n[//]: # ()\n[//]: # (```shell)\n\n[//]: # (pip install llamaindex)\n\n[//]: # (```)\n\n[//]: # ()\n[//]: # ()\n[//]: # (```python)\n\n[//]: # ()\n[//]: # ()\n[//]: # (```)\n\n**Use Pretrained sentence embedding**\n```python\nfrom retrievals import AutoModelForEmbedding\n\nsentences = [\"Hello world\", \"How are you?\"]\nmodel_name_or_path = \"sentence-transformers/all-MiniLM-L6-v2\"\nmodel = AutoModelForEmbedding(model_name_or_path, pooling_method=\"mean\", normalize_embeddings=True)\nsentence_embeddings = model.encode(sentences, convert_to_tensor=True)\nprint(sentence_embeddings)\n```\n\n\n**Finetune transformers by contrastive learning**\n```python\nfrom transformers import AutoTokenizer\nfrom retrievals import AutoModelForEmbedding, AutoModelForRetrieval, RetrievalTrainer, PairCollator, TripletCollator\nfrom retrievals.losses import ArcFaceAdaptiveMarginLoss, InfoNCE, SimCSE, TripletLoss\nfrom retrievals.data import RetrievalDataset, RerankDataset\n\n\ntrain_dataset = RetrievalDataset(args=data_args)\ntokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, use_fast=False)\n\nmodel = AutoModelForEmbedding(\n model_args.model_name_or_path,\n pooling_method=\"cls\"\n)\noptimizer = get_optimizer(model, lr=5e-5, weight_decay=1e-3)\n\nlr_scheduler = get_scheduler(optimizer, num_train_steps=int(len(train_dataset) / 2 * 1))\n\ntrainer = RetrievalTrainer(\n model=model,\n args=training_args,\n train_dataset=train_dataset,\n data_collator=TripletCollator(tokenizer, max_length=data_args.query_max_len),\n loss_fn=TripletLoss(),\n)\ntrainer.optimizer = optimizer\ntrainer.scheduler = lr_scheduler\ntrainer.train()\n```\n\n**Finetune LLM for embedding by Contrastive learning**\n```python\n\nfrom retrievals import AutoModelForEmbedding\n\nmodel = AutoModelForEmbedding(\n \"mistralai/Mistral-7B-v0.1\",\n pooling_method='cls',\n query_instruction=f'Instruct: Retrieve semantically similar text\\nQuery: '\n)\n```\n\n**Search by Cosine similarity/KNN**\n```python\nfrom retrievals import AutoModelForEmbedding, AutoModelForRetrieval\n\nquery_texts = ['A dog is chasing car.']\npassage_texts = ['A man is playing a guitar.', 'A bee is flying low']\nmodel_name_or_path = \"sentence-transformers/all-MiniLM-L6-v2\"\nmodel = AutoModelForEmbedding(model_name_or_path)\nquery_embeddings = model.encode(query_texts, convert_to_tensor=True)\npassage_embeddings = model.encode(passage_texts, convert_to_tensor=True)\n\nmatcher = AutoModelForRetrieval(method='cosine')\ndists, indices = matcher.similarity_search(query_embeddings, passage_embeddings, top_k=1)\n```\n\n## Reference & Acknowledge\n- [sentence-transformers](https://github.com/UKPLab/sentence-transformers)\n- [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding)\n- [uniem](https://github.com/wangyuxinwhy/uniem)\n- [BCEmbedding](https://github.com/netease-youdao/BCEmbedding)\n\n\n",
"bugtrack_url": null,
"license": "Apache 2.0 License",
"summary": "Text Embeddings for Retrieval and RAG based on transformers",
"version": "0.0.2",
"project_urls": {
"Homepage": "https://github.com/LongxingTan/open-retrievals"
},
"split_keywords": [
"nlp",
"retrieval",
"rag",
"rerank",
"text",
"embedding",
"contrastive"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "2e518886decaf064d1b5ff1248412cc761dd00c23ec38bde8c59fa57884f2ddf",
"md5": "6fa827df27b288c3fec83ce558387652",
"sha256": "e14ed32709ac7ed8ae0d64b9dbbfe938de7a1b8c03cfa1f9bdf4ceb47ff9dcd7"
},
"downloads": -1,
"filename": "open_retrievals-0.0.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "6fa827df27b288c3fec83ce558387652",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8.0",
"size": 44408,
"upload_time": "2024-04-15T11:13:49",
"upload_time_iso_8601": "2024-04-15T11:13:49.956799Z",
"url": "https://files.pythonhosted.org/packages/2e/51/8886decaf064d1b5ff1248412cc761dd00c23ec38bde8c59fa57884f2ddf/open_retrievals-0.0.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "53cafdac8064d9cab4e846c64a5e9271867fe884f8c5dde33a8f5c10501d0a16",
"md5": "d723a0ccf566580f25b77401bd485ae1",
"sha256": "e5dbf67f759eaae3c0c16430b166fadffbc4a1745181742ca5d55adea54ffd01"
},
"downloads": -1,
"filename": "open-retrievals-0.0.2.tar.gz",
"has_sig": false,
"md5_digest": "d723a0ccf566580f25b77401bd485ae1",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8.0",
"size": 35282,
"upload_time": "2024-04-15T11:13:51",
"upload_time_iso_8601": "2024-04-15T11:13:51.395603Z",
"url": "https://files.pythonhosted.org/packages/53/ca/fdac8064d9cab4e846c64a5e9271867fe884f8c5dde33a8f5c10501d0a16/open-retrievals-0.0.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-04-15 11:13:51",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "LongxingTan",
"github_project": "open-retrievals",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "torch",
"specs": []
},
{
"name": "transformers",
"specs": []
},
{
"name": "accelerator",
"specs": []
},
{
"name": "peft",
"specs": []
},
{
"name": "datasets",
"specs": []
},
{
"name": "faiss-cpu",
"specs": []
},
{
"name": "scikit-learn",
"specs": []
},
{
"name": "tqdm",
"specs": []
}
],
"lcname": "open-retrievals"
}