open-retrievals


Nameopen-retrievals JSON
Version 0.0.2 PyPI version JSON
download
home_pagehttps://github.com/LongxingTan/open-retrievals
SummaryText Embeddings for Retrieval and RAG based on transformers
upload_time2024-04-15 11:13:51
maintainerNone
docs_urlNone
authorLongxing Tan
requires_python>=3.8.0
licenseApache 2.0 License
keywords nlp retrieval rag rerank text embedding contrastive
VCS
bugtrack_url
requirements torch transformers accelerator peft datasets faiss-cpu scikit-learn tqdm
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [license-image]: https://img.shields.io/badge/License-Apache%202.0-blue.svg
[license-url]: https://opensource.org/licenses/Apache-2.0
[pypi-image]: https://badge.fury.io/py/open-retrievals.svg
[pypi-url]: https://pypi.org/project/open-retrievals
[pepy-image]: https://pepy.tech/badge/retrievals/month
[pepy-url]: https://pepy.tech/project/retrievals
[build-image]: https://github.com/LongxingTan/open-retrievals/actions/workflows/test.yml/badge.svg?branch=master
[build-url]: https://github.com/LongxingTan/open-retrievals/actions/workflows/test.yml?query=branch%3Amaster
[lint-image]: https://github.com/LongxingTan/open-retrievals/actions/workflows/lint.yml/badge.svg?branch=master
[lint-url]: https://github.com/LongxingTan/open-retrievals/actions/workflows/lint.yml?query=branch%3Amaster
[docs-image]: https://readthedocs.org/projects/open-retrievals/badge/?version=latest
[docs-url]: https://open-retrievals.readthedocs.io/en/latest/?version=latest
[coverage-image]: https://codecov.io/gh/longxingtan/open-retrievals/branch/master/graph/badge.svg
[coverage-url]: https://codecov.io/github/longxingtan/open-retrievals?branch=master

<h1 align="center">
<img src="./docs/source/_static/logo.svg" width="520" align=center/>
</h1><br>

[![LICENSE][license-image]][license-url]
[![PyPI Version][pypi-image]][pypi-url]
[![Build Status][build-image]][build-url]
[![Lint Status][lint-image]][lint-url]
[![Docs Status][docs-image]][docs-url]
[![Code Coverage][coverage-image]][coverage-url]


**[Documentation](https://open-retrievals.readthedocs.io)** | **[Tutorials](https://open-retrievals.readthedocs.io/en/latest/tutorials.html)** | **[中文](https://github.com/LongxingTan/open-retrievals/blob/master/README_zh-CN.md)**

**Open-Retrievals** is an easy-to-use python framework getting SOTA text embeddings, oriented to information retrieval and LLM retrieval augmented generation, based on PyTorch and Transformers.
- Contrastive learning enhanced embeddings
- LLM embeddings


## Installation

**Prerequisites**
```shell
pip install transformers
pip install faiss-cpu
pip install peft
```

**With pip**
```shell
pip install open-retrievals
```

[//]: # (**With conda**)

[//]: # (```shell)

[//]: # (conda install open-retrievals -c conda-forge)

[//]: # (```)

## Quick-start

```python
from retrievals import AutoModelForEmbedding, AutoModelForRetrieval

# Example list of documents
documents = [
    "Open-retrievals is a text embedding libraries",
    "I can use it simply with a SOTA RAG application.",
]

# This will trigger the model download and initialization
model_name_or_path = "sentence-transformers/all-MiniLM-L6-v2"
model = AutoModelForEmbedding(model_name_or_path)

embeddings = model.encode(documents)
len(embeddings) # Vector of 384 dimensions
```


## Usage

**Build Index and Retrieval**
```python
from retrievals import AutoModelForEmbedding, AutoModelForRetrieval

sentences = ['A dog is chasing car.', 'A man is playing a guitar.']
model_name_or_path = "sentence-transformers/all-MiniLM-L6-v2"
model = AutoModelForEmbedding(model_name_or_path)
model.build_index(sentences)

matcher = AutoModelForRetrieval()
results = matcher.faiss_search("He plays guitar.")
```

**Rerank**
```python
from transformers import AutoTokenizer
from retrievals import RerankCollator, RerankModel, RerankTrainer, RerankDataset

train_dataset = RerankDataset(args=data_args)
tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, use_fast=False)

model = RerankModel(
    model_args.model_name_or_path,
    pooling_method="mean"
)
optimizer = get_optimizer(model, lr=5e-5, weight_decay=1e-3)

lr_scheduler = get_scheduler(optimizer, num_train_steps=int(len(train_dataset) / 2 * 1))

trainer = RerankTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    data_collator=RerankCollator(tokenizer, max_length=data_args.query_max_len),
)
trainer.optimizer = optimizer
trainer.scheduler = lr_scheduler
trainer.train()
```

**RAG with LangChain**


- Prerequisites

```shell
pip install langchain
```


- Server

```python
from retrievals.tools.langchain import LangchainEmbedding, LangchainReranker
from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.vectorstores import Chroma as Vectorstore


class DenseRetrieval:
    def __init__(self, persist_directory):
        embeddings = LangchainEmbedding(model_name="BAAI/bge-large-zh-v1.5")
        vectordb = Vectorstore(
            persist_directory=persist_directory,
            embedding_function=embeddings,
        )
        retrieval_args = {"search_type" :"similarity", "score_threshold": 0.15, "k": 30}
        self.retriever = vectordb.as_retriever(retrieval_args)

        reranker_args = {
            "model": "../../inputs/bce-reranker-base_v1",
            "top_n": 7,
            "device": "cuda",
            "use_fp16": True,
        }
        self.reranker = LangchainReranker(**reranker_args)
        self.compression_retriever = ContextualCompressionRetriever(
            base_compressor=self.reranker, base_retriever=self.retriever
        )

    def query(
        self,
        question: str
    ):
        docs = self.compression_retriever.get_relevant_documents(question)
        return docs
```

[//]: # (**RAG with LLamaIndex**)

[//]: # ()
[//]: # (```shell)

[//]: # (pip install llamaindex)

[//]: # (```)

[//]: # ()
[//]: # ()
[//]: # (```python)

[//]: # ()
[//]: # ()
[//]: # (```)

**Use Pretrained sentence embedding**
```python
from retrievals import AutoModelForEmbedding

sentences = ["Hello world", "How are you?"]
model_name_or_path = "sentence-transformers/all-MiniLM-L6-v2"
model = AutoModelForEmbedding(model_name_or_path, pooling_method="mean", normalize_embeddings=True)
sentence_embeddings = model.encode(sentences, convert_to_tensor=True)
print(sentence_embeddings)
```


**Finetune transformers by contrastive learning**
```python
from transformers import AutoTokenizer
from retrievals import AutoModelForEmbedding, AutoModelForRetrieval, RetrievalTrainer, PairCollator, TripletCollator
from retrievals.losses import ArcFaceAdaptiveMarginLoss, InfoNCE, SimCSE, TripletLoss
from retrievals.data import  RetrievalDataset, RerankDataset


train_dataset = RetrievalDataset(args=data_args)
tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, use_fast=False)

model = AutoModelForEmbedding(
    model_args.model_name_or_path,
    pooling_method="cls"
)
optimizer = get_optimizer(model, lr=5e-5, weight_decay=1e-3)

lr_scheduler = get_scheduler(optimizer, num_train_steps=int(len(train_dataset) / 2 * 1))

trainer = RetrievalTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    data_collator=TripletCollator(tokenizer, max_length=data_args.query_max_len),
    loss_fn=TripletLoss(),
)
trainer.optimizer = optimizer
trainer.scheduler = lr_scheduler
trainer.train()
```

**Finetune LLM for embedding by Contrastive learning**
```python

from retrievals import AutoModelForEmbedding

model = AutoModelForEmbedding(
    "mistralai/Mistral-7B-v0.1",
    pooling_method='cls',
    query_instruction=f'Instruct: Retrieve semantically similar text\nQuery: '
)
```

**Search by Cosine similarity/KNN**
```python
from retrievals import AutoModelForEmbedding, AutoModelForRetrieval

query_texts = ['A dog is chasing car.']
passage_texts = ['A man is playing a guitar.', 'A bee is flying low']
model_name_or_path = "sentence-transformers/all-MiniLM-L6-v2"
model = AutoModelForEmbedding(model_name_or_path)
query_embeddings = model.encode(query_texts, convert_to_tensor=True)
passage_embeddings = model.encode(passage_texts, convert_to_tensor=True)

matcher = AutoModelForRetrieval(method='cosine')
dists, indices = matcher.similarity_search(query_embeddings, passage_embeddings, top_k=1)
```

## Reference & Acknowledge
- [sentence-transformers](https://github.com/UKPLab/sentence-transformers)
- [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding)
- [uniem](https://github.com/wangyuxinwhy/uniem)
- [BCEmbedding](https://github.com/netease-youdao/BCEmbedding)



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/LongxingTan/open-retrievals",
    "name": "open-retrievals",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8.0",
    "maintainer_email": null,
    "keywords": "NLP retrieval RAG rerank text embedding contrastive",
    "author": "Longxing Tan",
    "author_email": "tanlongxing888@163.com",
    "download_url": "https://files.pythonhosted.org/packages/53/ca/fdac8064d9cab4e846c64a5e9271867fe884f8c5dde33a8f5c10501d0a16/open-retrievals-0.0.2.tar.gz",
    "platform": null,
    "description": "[license-image]: https://img.shields.io/badge/License-Apache%202.0-blue.svg\n[license-url]: https://opensource.org/licenses/Apache-2.0\n[pypi-image]: https://badge.fury.io/py/open-retrievals.svg\n[pypi-url]: https://pypi.org/project/open-retrievals\n[pepy-image]: https://pepy.tech/badge/retrievals/month\n[pepy-url]: https://pepy.tech/project/retrievals\n[build-image]: https://github.com/LongxingTan/open-retrievals/actions/workflows/test.yml/badge.svg?branch=master\n[build-url]: https://github.com/LongxingTan/open-retrievals/actions/workflows/test.yml?query=branch%3Amaster\n[lint-image]: https://github.com/LongxingTan/open-retrievals/actions/workflows/lint.yml/badge.svg?branch=master\n[lint-url]: https://github.com/LongxingTan/open-retrievals/actions/workflows/lint.yml?query=branch%3Amaster\n[docs-image]: https://readthedocs.org/projects/open-retrievals/badge/?version=latest\n[docs-url]: https://open-retrievals.readthedocs.io/en/latest/?version=latest\n[coverage-image]: https://codecov.io/gh/longxingtan/open-retrievals/branch/master/graph/badge.svg\n[coverage-url]: https://codecov.io/github/longxingtan/open-retrievals?branch=master\n\n<h1 align=\"center\">\n<img src=\"./docs/source/_static/logo.svg\" width=\"520\" align=center/>\n</h1><br>\n\n[![LICENSE][license-image]][license-url]\n[![PyPI Version][pypi-image]][pypi-url]\n[![Build Status][build-image]][build-url]\n[![Lint Status][lint-image]][lint-url]\n[![Docs Status][docs-image]][docs-url]\n[![Code Coverage][coverage-image]][coverage-url]\n\n\n**[Documentation](https://open-retrievals.readthedocs.io)** | **[Tutorials](https://open-retrievals.readthedocs.io/en/latest/tutorials.html)** | **[\u4e2d\u6587](https://github.com/LongxingTan/open-retrievals/blob/master/README_zh-CN.md)**\n\n**Open-Retrievals** is an easy-to-use python framework getting SOTA text embeddings, oriented to information retrieval and LLM retrieval augmented generation, based on PyTorch and Transformers.\n- Contrastive learning enhanced embeddings\n- LLM embeddings\n\n\n## Installation\n\n**Prerequisites**\n```shell\npip install transformers\npip install faiss-cpu\npip install peft\n```\n\n**With pip**\n```shell\npip install open-retrievals\n```\n\n[//]: # (**With conda**)\n\n[//]: # (```shell)\n\n[//]: # (conda install open-retrievals -c conda-forge)\n\n[//]: # (```)\n\n## Quick-start\n\n```python\nfrom retrievals import AutoModelForEmbedding, AutoModelForRetrieval\n\n# Example list of documents\ndocuments = [\n    \"Open-retrievals is a text embedding libraries\",\n    \"I can use it simply with a SOTA RAG application.\",\n]\n\n# This will trigger the model download and initialization\nmodel_name_or_path = \"sentence-transformers/all-MiniLM-L6-v2\"\nmodel = AutoModelForEmbedding(model_name_or_path)\n\nembeddings = model.encode(documents)\nlen(embeddings) # Vector of 384 dimensions\n```\n\n\n## Usage\n\n**Build Index and Retrieval**\n```python\nfrom retrievals import AutoModelForEmbedding, AutoModelForRetrieval\n\nsentences = ['A dog is chasing car.', 'A man is playing a guitar.']\nmodel_name_or_path = \"sentence-transformers/all-MiniLM-L6-v2\"\nmodel = AutoModelForEmbedding(model_name_or_path)\nmodel.build_index(sentences)\n\nmatcher = AutoModelForRetrieval()\nresults = matcher.faiss_search(\"He plays guitar.\")\n```\n\n**Rerank**\n```python\nfrom transformers import AutoTokenizer\nfrom retrievals import RerankCollator, RerankModel, RerankTrainer, RerankDataset\n\ntrain_dataset = RerankDataset(args=data_args)\ntokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, use_fast=False)\n\nmodel = RerankModel(\n    model_args.model_name_or_path,\n    pooling_method=\"mean\"\n)\noptimizer = get_optimizer(model, lr=5e-5, weight_decay=1e-3)\n\nlr_scheduler = get_scheduler(optimizer, num_train_steps=int(len(train_dataset) / 2 * 1))\n\ntrainer = RerankTrainer(\n    model=model,\n    args=training_args,\n    train_dataset=train_dataset,\n    data_collator=RerankCollator(tokenizer, max_length=data_args.query_max_len),\n)\ntrainer.optimizer = optimizer\ntrainer.scheduler = lr_scheduler\ntrainer.train()\n```\n\n**RAG with LangChain**\n\n\n- Prerequisites\n\n```shell\npip install langchain\n```\n\n\n- Server\n\n```python\nfrom retrievals.tools.langchain import LangchainEmbedding, LangchainReranker\nfrom langchain.retrievers import ContextualCompressionRetriever\nfrom langchain_community.vectorstores import Chroma as Vectorstore\n\n\nclass DenseRetrieval:\n    def __init__(self, persist_directory):\n        embeddings = LangchainEmbedding(model_name=\"BAAI/bge-large-zh-v1.5\")\n        vectordb = Vectorstore(\n            persist_directory=persist_directory,\n            embedding_function=embeddings,\n        )\n        retrieval_args = {\"search_type\" :\"similarity\", \"score_threshold\": 0.15, \"k\": 30}\n        self.retriever = vectordb.as_retriever(retrieval_args)\n\n        reranker_args = {\n            \"model\": \"../../inputs/bce-reranker-base_v1\",\n            \"top_n\": 7,\n            \"device\": \"cuda\",\n            \"use_fp16\": True,\n        }\n        self.reranker = LangchainReranker(**reranker_args)\n        self.compression_retriever = ContextualCompressionRetriever(\n            base_compressor=self.reranker, base_retriever=self.retriever\n        )\n\n    def query(\n        self,\n        question: str\n    ):\n        docs = self.compression_retriever.get_relevant_documents(question)\n        return docs\n```\n\n[//]: # (**RAG with LLamaIndex**)\n\n[//]: # ()\n[//]: # (```shell)\n\n[//]: # (pip install llamaindex)\n\n[//]: # (```)\n\n[//]: # ()\n[//]: # ()\n[//]: # (```python)\n\n[//]: # ()\n[//]: # ()\n[//]: # (```)\n\n**Use Pretrained sentence embedding**\n```python\nfrom retrievals import AutoModelForEmbedding\n\nsentences = [\"Hello world\", \"How are you?\"]\nmodel_name_or_path = \"sentence-transformers/all-MiniLM-L6-v2\"\nmodel = AutoModelForEmbedding(model_name_or_path, pooling_method=\"mean\", normalize_embeddings=True)\nsentence_embeddings = model.encode(sentences, convert_to_tensor=True)\nprint(sentence_embeddings)\n```\n\n\n**Finetune transformers by contrastive learning**\n```python\nfrom transformers import AutoTokenizer\nfrom retrievals import AutoModelForEmbedding, AutoModelForRetrieval, RetrievalTrainer, PairCollator, TripletCollator\nfrom retrievals.losses import ArcFaceAdaptiveMarginLoss, InfoNCE, SimCSE, TripletLoss\nfrom retrievals.data import  RetrievalDataset, RerankDataset\n\n\ntrain_dataset = RetrievalDataset(args=data_args)\ntokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, use_fast=False)\n\nmodel = AutoModelForEmbedding(\n    model_args.model_name_or_path,\n    pooling_method=\"cls\"\n)\noptimizer = get_optimizer(model, lr=5e-5, weight_decay=1e-3)\n\nlr_scheduler = get_scheduler(optimizer, num_train_steps=int(len(train_dataset) / 2 * 1))\n\ntrainer = RetrievalTrainer(\n    model=model,\n    args=training_args,\n    train_dataset=train_dataset,\n    data_collator=TripletCollator(tokenizer, max_length=data_args.query_max_len),\n    loss_fn=TripletLoss(),\n)\ntrainer.optimizer = optimizer\ntrainer.scheduler = lr_scheduler\ntrainer.train()\n```\n\n**Finetune LLM for embedding by Contrastive learning**\n```python\n\nfrom retrievals import AutoModelForEmbedding\n\nmodel = AutoModelForEmbedding(\n    \"mistralai/Mistral-7B-v0.1\",\n    pooling_method='cls',\n    query_instruction=f'Instruct: Retrieve semantically similar text\\nQuery: '\n)\n```\n\n**Search by Cosine similarity/KNN**\n```python\nfrom retrievals import AutoModelForEmbedding, AutoModelForRetrieval\n\nquery_texts = ['A dog is chasing car.']\npassage_texts = ['A man is playing a guitar.', 'A bee is flying low']\nmodel_name_or_path = \"sentence-transformers/all-MiniLM-L6-v2\"\nmodel = AutoModelForEmbedding(model_name_or_path)\nquery_embeddings = model.encode(query_texts, convert_to_tensor=True)\npassage_embeddings = model.encode(passage_texts, convert_to_tensor=True)\n\nmatcher = AutoModelForRetrieval(method='cosine')\ndists, indices = matcher.similarity_search(query_embeddings, passage_embeddings, top_k=1)\n```\n\n## Reference & Acknowledge\n- [sentence-transformers](https://github.com/UKPLab/sentence-transformers)\n- [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding)\n- [uniem](https://github.com/wangyuxinwhy/uniem)\n- [BCEmbedding](https://github.com/netease-youdao/BCEmbedding)\n\n\n",
    "bugtrack_url": null,
    "license": "Apache 2.0 License",
    "summary": "Text Embeddings for Retrieval and RAG based on transformers",
    "version": "0.0.2",
    "project_urls": {
        "Homepage": "https://github.com/LongxingTan/open-retrievals"
    },
    "split_keywords": [
        "nlp",
        "retrieval",
        "rag",
        "rerank",
        "text",
        "embedding",
        "contrastive"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2e518886decaf064d1b5ff1248412cc761dd00c23ec38bde8c59fa57884f2ddf",
                "md5": "6fa827df27b288c3fec83ce558387652",
                "sha256": "e14ed32709ac7ed8ae0d64b9dbbfe938de7a1b8c03cfa1f9bdf4ceb47ff9dcd7"
            },
            "downloads": -1,
            "filename": "open_retrievals-0.0.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "6fa827df27b288c3fec83ce558387652",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8.0",
            "size": 44408,
            "upload_time": "2024-04-15T11:13:49",
            "upload_time_iso_8601": "2024-04-15T11:13:49.956799Z",
            "url": "https://files.pythonhosted.org/packages/2e/51/8886decaf064d1b5ff1248412cc761dd00c23ec38bde8c59fa57884f2ddf/open_retrievals-0.0.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "53cafdac8064d9cab4e846c64a5e9271867fe884f8c5dde33a8f5c10501d0a16",
                "md5": "d723a0ccf566580f25b77401bd485ae1",
                "sha256": "e5dbf67f759eaae3c0c16430b166fadffbc4a1745181742ca5d55adea54ffd01"
            },
            "downloads": -1,
            "filename": "open-retrievals-0.0.2.tar.gz",
            "has_sig": false,
            "md5_digest": "d723a0ccf566580f25b77401bd485ae1",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8.0",
            "size": 35282,
            "upload_time": "2024-04-15T11:13:51",
            "upload_time_iso_8601": "2024-04-15T11:13:51.395603Z",
            "url": "https://files.pythonhosted.org/packages/53/ca/fdac8064d9cab4e846c64a5e9271867fe884f8c5dde33a8f5c10501d0a16/open-retrievals-0.0.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-15 11:13:51",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "LongxingTan",
    "github_project": "open-retrievals",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "torch",
            "specs": []
        },
        {
            "name": "transformers",
            "specs": []
        },
        {
            "name": "accelerator",
            "specs": []
        },
        {
            "name": "peft",
            "specs": []
        },
        {
            "name": "datasets",
            "specs": []
        },
        {
            "name": "faiss-cpu",
            "specs": []
        },
        {
            "name": "scikit-learn",
            "specs": []
        },
        {
            "name": "tqdm",
            "specs": []
        }
    ],
    "lcname": "open-retrievals"
}
        
Elapsed time: 0.48228s