retriv


Nameretriv JSON
Version 0.2.3 PyPI version JSON
download
home_pagehttps://github.com/AmenRa/retriv
Summaryretriv: A Python Search Engine for Humans.
upload_time2023-08-24 08:56:45
maintainer
docs_urlNone
authorElias Bassani
requires_python>=3.8
license
keywords information retrieval search engine bm25 numba sparse retrieval dense retrieval hybrid retrieval neural information retrieval
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <div align="center">
  <img src="https://repository-images.githubusercontent.com/566840861/ce7eeed0-7454-4aff-9073-235a83eeb6e7">
</div>

<p align="center">
  <!-- Python -->
  <a href="https://www.python.org" alt="Python">
      <img src="https://badges.aleen42.com/src/python.svg" />
  </a>
  <!-- Version -->
  <a href="https://badge.fury.io/py/retriv"><img src="https://badge.fury.io/py/retriv.svg" alt="PyPI version" height="18"></a>
  <!-- Docs -->
  <!-- <a href="https://amenra.github.io/retriv"><img src="https://img.shields.io/badge/docs-passing-<COLOR>.svg" alt="Documentation Status"></a> -->
  <!-- Black -->
  <a href="https://github.com/psf/black" alt="Code style: black">
      <img src="https://img.shields.io/badge/code%20style-black-000000.svg" />
  </a>
  <!-- License -->
  <a href="https://lbesson.mit-license.org/"><img src="https://img.shields.io/badge/License-MIT-blue.svg" alt="License: MIT"></a>
  <!-- Google Colab -->
  <!-- <a href="https://colab.research.google.com/github/AmenRa/retriv/blob/master/notebooks/1_overview.ipynb"> -->
      <!-- <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/> -->
  </a>
</p>

## 🔥 News
- [August 23, 2023] `retriv` 0.2.2 is out!  
This release adds _experimental_ support for multi-field documents and filters.
Please, refer to [Advanced Retriever](https://github.com/AmenRa/retriv/blob/main/docs/advanced_retriever.md) documentation.

- [February 18, 2023] `retriv` 0.2.0 is out!  
This release adds support for Dense and Hybrid Retrieval.
Dense Retrieval leverages the semantic similarity of the queries' and documents' vector representations, which can be computed directly by `retriv` or imported from other sources.
Hybrid Retrieval mix traditional retrieval, informally called Sparse Retrieval,  and Dense Retrieval results to further improve retrieval effectiveness.
As the library was almost completely redone, indices built with previous versions are no longer supported.

## ⚡️ Introduction

[retriv](https://github.com/AmenRa/retriv) is a user-friendly and efficient [search engine](https://en.wikipedia.org/wiki/Search_engine) implemented in [Python](https://en.wikipedia.org/wiki/Python_(programming_language)) supporting Sparse (traditional search with [BM25](https://en.wikipedia.org/wiki/Okapi_BM25), [TF-IDF](https://en.wikipedia.org/wiki/Tf–idf)), Dense ([semantic search](https://en.wikipedia.org/wiki/Semantic_search)) and Hybrid retrieval (a mix of Sparse and Dense Retrieval).
It allows you to build a search engine in a __single line of code__.

[retriv](https://github.com/AmenRa/retriv) is built upon [Numba](https://github.com/numba/numba) for high-speed [vector operations](https://en.wikipedia.org/wiki/Automatic_vectorization) and [automatic parallelization](https://en.wikipedia.org/wiki/Automatic_parallelization), [PyTorch](https://pytorch.org) and [Transformers](https://huggingface.co/docs/transformers/index) for easy access and usage of [Transformer-based Language Models](https://web.stanford.edu/~jurafsky/slp3/10.pdf), and [Faiss](https://github.com/facebookresearch/faiss) for approximate [nearest neighbor search](https://en.wikipedia.org/wiki/Nearest_neighbor_search).
In addition, it provides automatic tuning functionalities to allow you to tune its internal components with minimal intervention.


## ✨ Main Features

### Retrievers
- [Sparse Retriever](https://github.com/AmenRa/retriv/blob/main/docs/sparse_retriever.md): standard searcher based on lexical matching. 
[retriv](https://github.com/AmenRa/retriv) implements [BM25](https://en.wikipedia.org/wiki/Okapi_BM25) as its main retrieval model.
[TF-IDF](https://en.wikipedia.org/wiki/Tf–idf) is also supported for educational purposes.
The sparse retriever comes armed with multiple [stemmers](https://en.wikipedia.org/wiki/Stemming), [tokenizers](https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization), and [stop-word](https://en.wikipedia.org/wiki/Stop_word) lists, for multiple languages.
Click [here](https://github.com/AmenRa/retriv/blob/main/docs/sparse_retriever.md) to learn more.
- [Dense Retriever](https://github.com/AmenRa/retriv/blob/main/docs/dense_retriever.md): a dense retriever is a retrieval model that performs [semantic search](https://en.wikipedia.org/wiki/Semantic_search). 
Click [here](https://github.com/AmenRa/retriv/blob/main/docs/dense_retriever.md) to learn more.
- [Hybrid Retriever](https://github.com/AmenRa/retriv/blob/main/docs/hybrid_retriever.md): an hybrid retriever is a retrieval model built on top of a sparse and a dense retriever.
Click [here](https://github.com/AmenRa/retriv/blob/main/docs/hybrid_retriever.md) to learn more.
- [Advanced Retriever](https://github.com/AmenRa/retriv/blob/main/docs/advanced_retriever.md): an advanced sparse retriever supporting filters. This is and experimental feature.
Click [here](https://github.com/AmenRa/retriv/blob/main/docs/advanced_retriever.md) to learn more.

### Unified Search Interface
All the supported retrievers share the same search interface:
- [search](#search): standard search functionality, what you expect by a search engine.
- [msearch](#multi-search): computes the results for multiple queries at once.
It leverages [automatic parallelization](https://en.wikipedia.org/wiki/Automatic_parallelization) whenever possible.
- [bsearch](#batch-search): similar to [msearch](#multi-search) but automatically generates batches of queries to evaluate and allows dynamic writing of the search results to disk in [JSONl](https://jsonlines.org) format. [bsearch](#batch-search) is handy for computing results for hundreds of thousands or even millions of queries without hogging your RAM. Pre-computed results can be leveraged for negative sampling during the training of [Neural Models](https://en.wikipedia.org/wiki/Artificial_neural_network) for [Information Retrieval](https://en.wikipedia.org/wiki/Information_retrieval).

### AutoTune
[retriv](https://github.com/AmenRa/retriv) automatically tunes [Faiss](https://github.com/facebookresearch/faiss) configuration for approximate nearest neighbors search by leveraging [AutoFaiss](https://github.com/criteo/autofaiss) to guarantee 10ms response time based on your available hardware.
Moreover, it offers an automatic tuning functionality for [BM25](https://en.wikipedia.org/wiki/Okapi_BM25)'s parameters, which require minimal user intervention.
Under the hood, [retriv](https://github.com/AmenRa/retriv) leverages [Optuna](https://optuna.org), a [hyperparameter optimization](https://en.wikipedia.org/wiki/Hyperparameter_optimization) framework, and [ranx](https://github.com/AmenRa/ranx), an [Information Retrieval](https://en.wikipedia.org/wiki/Information_retrieval) evaluation library, to test several parameter configurations for [BM25](https://en.wikipedia.org/wiki/Okapi_BM25) and choose the best one.
Finally, it can automatically balance the importance of lexical and semantic relevance scores computed by the [Hybrid Retriever](https://github.com/AmenRa/retriv/blob/main/docs/hybrid_retriever.md) to maximize retrieval effectiveness.

## 📚 Documentation

- [Sparse Retriever](https://github.com/AmenRa/retriv/blob/main/docs/sparse_retriever.md)
- [Dense Retriever](https://github.com/AmenRa/retriv/blob/main/docs/dense_retriever.md)
- [Hybrid Retriever](https://github.com/AmenRa/retriv/blob/main/docs/hybrid_retriever.md)
- [Text Pre-Processing](https://github.com/AmenRa/retriv/blob/main/docs/text_preprocessing.md)
- [FAQ](https://github.com/AmenRa/retriv/blob/main/docs/faq.md)

## 🔌 Requirements
```
python>=3.8
```

## 💾 Installation
```bash
pip install retriv
```

## 💡 Minimal Working Example

```python
# Note: SearchEngine is an alias for the SparseRetriever
from retriv import SearchEngine

collection = [
  {"id": "doc_1", "text": "Generals gathered in their masses"},
  {"id": "doc_2", "text": "Just like witches at black masses"},
  {"id": "doc_3", "text": "Evil minds that plot destruction"},
  {"id": "doc_4", "text": "Sorcerer of death's construction"},
]

se = SearchEngine("new-index").index(collection)

se.search("witches masses")
```
Output:
```json
[
  {
    "id": "doc_2",
    "text": "Just like witches at black masses",
    "score": 1.7536403
  },
  {
    "id": "doc_1",
    "text": "Generals gathered in their masses",
    "score": 0.6931472
  }
]
```






## 🎁 Feature Requests
Would you like to see other features implemented? Please, open a [feature request](https://github.com/AmenRa/retriv/issues/new?assignees=&labels=enhancement&template=feature_request.md&title=%5BFeature+Request%5D+title).


## 🤘 Want to contribute?
Would you like to contribute? Please, drop me an [e-mail](mailto:elias.bssn@gmail.com?subject=[GitHub]%20retriv).


## 📄 License
[retriv](https://github.com/AmenRa/retriv) is an open-sourced software licensed under the [MIT license](LICENSE).



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/AmenRa/retriv",
    "name": "retriv",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "information retrieval,search engine,bm25,numba,sparse retrieval,dense retrieval,hybrid retrieval,neural information retrieval",
    "author": "Elias Bassani",
    "author_email": "elias.bssn@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/7e/2a/26e1bd30d5426f518006923210f4c7387301a407b8115d157a0d352f6886/retriv-0.2.3.tar.gz",
    "platform": null,
    "description": "<div align=\"center\">\n  <img src=\"https://repository-images.githubusercontent.com/566840861/ce7eeed0-7454-4aff-9073-235a83eeb6e7\">\n</div>\n\n<p align=\"center\">\n  <!-- Python -->\n  <a href=\"https://www.python.org\" alt=\"Python\">\n      <img src=\"https://badges.aleen42.com/src/python.svg\" />\n  </a>\n  <!-- Version -->\n  <a href=\"https://badge.fury.io/py/retriv\"><img src=\"https://badge.fury.io/py/retriv.svg\" alt=\"PyPI version\" height=\"18\"></a>\n  <!-- Docs -->\n  <!-- <a href=\"https://amenra.github.io/retriv\"><img src=\"https://img.shields.io/badge/docs-passing-<COLOR>.svg\" alt=\"Documentation Status\"></a> -->\n  <!-- Black -->\n  <a href=\"https://github.com/psf/black\" alt=\"Code style: black\">\n      <img src=\"https://img.shields.io/badge/code%20style-black-000000.svg\" />\n  </a>\n  <!-- License -->\n  <a href=\"https://lbesson.mit-license.org/\"><img src=\"https://img.shields.io/badge/License-MIT-blue.svg\" alt=\"License: MIT\"></a>\n  <!-- Google Colab -->\n  <!-- <a href=\"https://colab.research.google.com/github/AmenRa/retriv/blob/master/notebooks/1_overview.ipynb\"> -->\n      <!-- <img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/> -->\n  </a>\n</p>\n\n## \ud83d\udd25 News\n- [August 23, 2023] `retriv` 0.2.2 is out!  \nThis release adds _experimental_ support for multi-field documents and filters.\nPlease, refer to [Advanced Retriever](https://github.com/AmenRa/retriv/blob/main/docs/advanced_retriever.md) documentation.\n\n- [February 18, 2023] `retriv` 0.2.0 is out!  \nThis release adds support for Dense and Hybrid Retrieval.\nDense Retrieval leverages the semantic similarity of the queries' and documents' vector representations, which can be computed directly by `retriv` or imported from other sources.\nHybrid Retrieval mix traditional retrieval, informally called Sparse Retrieval,  and Dense Retrieval results to further improve retrieval effectiveness.\nAs the library was almost completely redone, indices built with previous versions are no longer supported.\n\n## \u26a1\ufe0f Introduction\n\n[retriv](https://github.com/AmenRa/retriv) is a user-friendly and efficient [search engine](https://en.wikipedia.org/wiki/Search_engine) implemented in [Python](https://en.wikipedia.org/wiki/Python_(programming_language)) supporting Sparse (traditional search with [BM25](https://en.wikipedia.org/wiki/Okapi_BM25), [TF-IDF](https://en.wikipedia.org/wiki/Tf\u2013idf)), Dense ([semantic search](https://en.wikipedia.org/wiki/Semantic_search)) and Hybrid retrieval (a mix of Sparse and Dense Retrieval).\nIt allows you to build a search engine in a __single line of code__.\n\n[retriv](https://github.com/AmenRa/retriv) is built upon [Numba](https://github.com/numba/numba) for high-speed [vector operations](https://en.wikipedia.org/wiki/Automatic_vectorization) and [automatic parallelization](https://en.wikipedia.org/wiki/Automatic_parallelization), [PyTorch](https://pytorch.org) and [Transformers](https://huggingface.co/docs/transformers/index) for easy access and usage of [Transformer-based Language Models](https://web.stanford.edu/~jurafsky/slp3/10.pdf), and [Faiss](https://github.com/facebookresearch/faiss) for approximate [nearest neighbor search](https://en.wikipedia.org/wiki/Nearest_neighbor_search).\nIn addition, it provides automatic tuning functionalities to allow you to tune its internal components with minimal intervention.\n\n\n## \u2728 Main Features\n\n### Retrievers\n- [Sparse Retriever](https://github.com/AmenRa/retriv/blob/main/docs/sparse_retriever.md): standard searcher based on lexical matching. \n[retriv](https://github.com/AmenRa/retriv) implements [BM25](https://en.wikipedia.org/wiki/Okapi_BM25) as its main retrieval model.\n[TF-IDF](https://en.wikipedia.org/wiki/Tf\u2013idf) is also supported for educational purposes.\nThe sparse retriever comes armed with multiple [stemmers](https://en.wikipedia.org/wiki/Stemming), [tokenizers](https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization), and [stop-word](https://en.wikipedia.org/wiki/Stop_word) lists, for multiple languages.\nClick [here](https://github.com/AmenRa/retriv/blob/main/docs/sparse_retriever.md) to learn more.\n- [Dense Retriever](https://github.com/AmenRa/retriv/blob/main/docs/dense_retriever.md): a dense retriever is a retrieval model that performs [semantic search](https://en.wikipedia.org/wiki/Semantic_search). \nClick [here](https://github.com/AmenRa/retriv/blob/main/docs/dense_retriever.md) to learn more.\n- [Hybrid Retriever](https://github.com/AmenRa/retriv/blob/main/docs/hybrid_retriever.md): an hybrid retriever is a retrieval model built on top of a sparse and a dense retriever.\nClick [here](https://github.com/AmenRa/retriv/blob/main/docs/hybrid_retriever.md) to learn more.\n- [Advanced Retriever](https://github.com/AmenRa/retriv/blob/main/docs/advanced_retriever.md): an advanced sparse retriever supporting filters. This is and experimental feature.\nClick [here](https://github.com/AmenRa/retriv/blob/main/docs/advanced_retriever.md) to learn more.\n\n### Unified Search Interface\nAll the supported retrievers share the same search interface:\n- [search](#search): standard search functionality, what you expect by a search engine.\n- [msearch](#multi-search): computes the results for multiple queries at once.\nIt leverages [automatic parallelization](https://en.wikipedia.org/wiki/Automatic_parallelization) whenever possible.\n- [bsearch](#batch-search): similar to [msearch](#multi-search) but automatically generates batches of queries to evaluate and allows dynamic writing of the search results to disk in [JSONl](https://jsonlines.org) format. [bsearch](#batch-search) is handy for computing results for hundreds of thousands or even millions of queries without hogging your RAM. Pre-computed results can be leveraged for negative sampling during the training of [Neural Models](https://en.wikipedia.org/wiki/Artificial_neural_network) for [Information Retrieval](https://en.wikipedia.org/wiki/Information_retrieval).\n\n### AutoTune\n[retriv](https://github.com/AmenRa/retriv) automatically tunes [Faiss](https://github.com/facebookresearch/faiss) configuration for approximate nearest neighbors search by leveraging [AutoFaiss](https://github.com/criteo/autofaiss) to guarantee 10ms response time based on your available hardware.\nMoreover, it offers an automatic tuning functionality for [BM25](https://en.wikipedia.org/wiki/Okapi_BM25)'s parameters, which require minimal user intervention.\nUnder the hood, [retriv](https://github.com/AmenRa/retriv) leverages [Optuna](https://optuna.org), a [hyperparameter optimization](https://en.wikipedia.org/wiki/Hyperparameter_optimization) framework, and [ranx](https://github.com/AmenRa/ranx), an [Information Retrieval](https://en.wikipedia.org/wiki/Information_retrieval) evaluation library, to test several parameter configurations for [BM25](https://en.wikipedia.org/wiki/Okapi_BM25) and choose the best one.\nFinally, it can automatically balance the importance of lexical and semantic relevance scores computed by the [Hybrid Retriever](https://github.com/AmenRa/retriv/blob/main/docs/hybrid_retriever.md) to maximize retrieval effectiveness.\n\n## \ud83d\udcda Documentation\n\n- [Sparse Retriever](https://github.com/AmenRa/retriv/blob/main/docs/sparse_retriever.md)\n- [Dense Retriever](https://github.com/AmenRa/retriv/blob/main/docs/dense_retriever.md)\n- [Hybrid Retriever](https://github.com/AmenRa/retriv/blob/main/docs/hybrid_retriever.md)\n- [Text Pre-Processing](https://github.com/AmenRa/retriv/blob/main/docs/text_preprocessing.md)\n- [FAQ](https://github.com/AmenRa/retriv/blob/main/docs/faq.md)\n\n## \ud83d\udd0c Requirements\n```\npython>=3.8\n```\n\n## \ud83d\udcbe Installation\n```bash\npip install retriv\n```\n\n## \ud83d\udca1 Minimal Working Example\n\n```python\n# Note: SearchEngine is an alias for the SparseRetriever\nfrom retriv import SearchEngine\n\ncollection = [\n  {\"id\": \"doc_1\", \"text\": \"Generals gathered in their masses\"},\n  {\"id\": \"doc_2\", \"text\": \"Just like witches at black masses\"},\n  {\"id\": \"doc_3\", \"text\": \"Evil minds that plot destruction\"},\n  {\"id\": \"doc_4\", \"text\": \"Sorcerer of death's construction\"},\n]\n\nse = SearchEngine(\"new-index\").index(collection)\n\nse.search(\"witches masses\")\n```\nOutput:\n```json\n[\n  {\n    \"id\": \"doc_2\",\n    \"text\": \"Just like witches at black masses\",\n    \"score\": 1.7536403\n  },\n  {\n    \"id\": \"doc_1\",\n    \"text\": \"Generals gathered in their masses\",\n    \"score\": 0.6931472\n  }\n]\n```\n\n\n\n\n\n\n## \ud83c\udf81 Feature Requests\nWould you like to see other features implemented? Please, open a [feature request](https://github.com/AmenRa/retriv/issues/new?assignees=&labels=enhancement&template=feature_request.md&title=%5BFeature+Request%5D+title).\n\n\n## \ud83e\udd18 Want to contribute?\nWould you like to contribute? Please, drop me an [e-mail](mailto:elias.bssn@gmail.com?subject=[GitHub]%20retriv).\n\n\n## \ud83d\udcc4 License\n[retriv](https://github.com/AmenRa/retriv) is an open-sourced software licensed under the [MIT license](LICENSE).\n\n\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "retriv: A Python Search Engine for Humans.",
    "version": "0.2.3",
    "project_urls": {
        "Homepage": "https://github.com/AmenRa/retriv"
    },
    "split_keywords": [
        "information retrieval",
        "search engine",
        "bm25",
        "numba",
        "sparse retrieval",
        "dense retrieval",
        "hybrid retrieval",
        "neural information retrieval"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "278028c1977092cf4451468ef96fade5a4098a9b70c3648a88e4b1f0649a397d",
                "md5": "170b1a33a92da5ec3827664d1083d018",
                "sha256": "88b04423b37440d66896aae7bc7a18919035cfb90a4af19f1e38b9f02552297a"
            },
            "downloads": -1,
            "filename": "retriv-0.2.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "170b1a33a92da5ec3827664d1083d018",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 40451,
            "upload_time": "2023-08-24T08:56:42",
            "upload_time_iso_8601": "2023-08-24T08:56:42.907803Z",
            "url": "https://files.pythonhosted.org/packages/27/80/28c1977092cf4451468ef96fade5a4098a9b70c3648a88e4b1f0649a397d/retriv-0.2.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7e2a26e1bd30d5426f518006923210f4c7387301a407b8115d157a0d352f6886",
                "md5": "abe54987db9eb1718015873ade1ae5ce",
                "sha256": "4fc95c15d327e7143f1f45b285143f42f408cd58fd6dc7f9818b66ec85327f89"
            },
            "downloads": -1,
            "filename": "retriv-0.2.3.tar.gz",
            "has_sig": false,
            "md5_digest": "abe54987db9eb1718015873ade1ae5ce",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 34463,
            "upload_time": "2023-08-24T08:56:45",
            "upload_time_iso_8601": "2023-08-24T08:56:45.690331Z",
            "url": "https://files.pythonhosted.org/packages/7e/2a/26e1bd30d5426f518006923210f4c7387301a407b8115d157a0d352f6886/retriv-0.2.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-08-24 08:56:45",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "AmenRa",
    "github_project": "retriv",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "retriv"
}
        
Elapsed time: 0.20847s