[![test](https://github.com/Guest400123064/bbm25-haystack/actions/workflows/test.yml/badge.svg)](https://github.com/Guest400123064/bbm25-haystack/actions/workflows/test.yml)
[![codecov](https://codecov.io/gh/Guest400123064/bbm25-haystack/graph/badge.svg?token=IGRIRBHZ3U)](https://codecov.io/gh/Guest400123064/bbm25-haystack)
[![code style - Black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![types - Mypy](https://img.shields.io/badge/types-Mypy-blue.svg)](https://github.com/python/mypy)
[![Python 3.9](https://img.shields.io/badge/python-3.9%20|%203.10%20|%203.11-blue.svg)](https://www.python.org/downloads/release/python-390/)
# Better BM25 In-Memory Document Store
An in-memory document store is a great starting point for prototyping and debugging before migrating to production-grade stores like Elasticsearch. However, [the original implementation](https://github.com/deepset-ai/haystack/blob/0dbb98c0a017b499560521aa93186d0640aab659/haystack/document_stores/in_memory/document_store.py#L148) of BM25 retrieval recreates an inverse index for the entire document store __on every new search__. Furthermore, the tokenization method is primitive, only permitting splitters based on regular expressions, making localization and domain adaptation challenging. Therefore, this implementation is a slight upgrade to the default BM25 in-memory document store by implementing incremental index update and incorporation of [SentencePiece](https://github.com/google/sentencepiece) statistical sub-word tokenization.
## Installation
```bash
$ pip install bbm25-haystack
```
Alternatively, you can clone the repository and build from source to be able to reflect changes to the source code:
```bash
$ git clone https://github.com/Guest400123064/bbm25-haystack.git
$ cd bbm25-haystack
$ pip install -e .
```
## Usage
### Quick Start
Below is an example of how you can build a minimal search engine with the `bbm25_haystack` components on their own. They are also compatible with [Haystack pipelines](https://docs.haystack.deepset.ai/docs/creating-pipelines).
```python
from haystack import Document
from bbm25_haystack import BetterBM25DocumentStore, BetterBM25Retriever
document_store = BetterBM25DocumentStore()
document_store.write_documents([
Document(content="There are over 7,000 languages spoken around the world today."),
Document(content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors."),
Document(content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bio-luminescent waves.")
])
retriever = BetterBM25Retriever(document_store)
retriever.run(query="How many languages are spoken around the world today?")
```
### API References
You can find the full API references [here](https://guest400123064.github.io/bbm25-haystack/). In a hurry? Below are some most important document store parameters you might want explore:
- `k, b, delta` - the [three BM25+ hyperparameters](https://en.wikipedia.org/wiki/Okapi_BM25).
- `sp_file` - a path to a trained SentencePiece tokenizer `.model` file. The default tokenizer is directly copied from [LLaMA-2-7B-32K tokenizer](https://huggingface.co/togethercomputer/LLaMA-2-7B-32K/blob/main/tokenizer.model) with a vocab size of 32,000.
- `n_grams` - default to 1, which means text (both query and document) are tokenized into unigrams. If set to 2, the tokenizer also augment the list of uni-grams with bi-grams, and so on. If specified as tuple, e.g., (2, 3), the tokenizer only produce bi-grams and tri-grams, without any uni-gram.
- `haystack_filter_logic` - see [below](#filtering-logic).
The retriever parameters are largely the same as [`InMemoryBM25Retriever`](https://docs.haystack.deepset.ai/docs/inmemorybm25retriever).
## Filtering Logic
The current document store uses [`document_matches_filter`](https://github.com/deepset-ai/haystack/blob/main/haystack/utils/filters.py) shipped with Haystack to perform filtering by default, which is the same as [`InMemoryDocumentStore`](https://docs.haystack.deepset.ai/docs/inmemorydocumentstore).
However, there is also an alternative filtering logic shipped with this implementation (unstable at this point). To use this alternative logic, initialize the document store with `haystack_filter_logic=False`. Please find comments and implementation details in [`filters.py`](./src/bbm25_haystack/filters.py). TL;DR:
- Comparison with `None`, i.e., missing values, involved will always return `False`, no matter missing the document attribute value or missing the filter value.
- Comparison with `pandas.DataFrame` is always prohibited to reduce surprises.
- No implicit `datetime` conversion from string values.
- `in` and `not in` allows any `Iterable` as filter value, without the `list` constraint.
In this case, the negation logic needs to be considered again because `False` can now issue from both input nullity check and the actual comparisons. For instance, `in` and `not in` both yield non-matching upon missing values. But I think having input processing and comparisons separated makes the filtering behavior more transparent.
## Search Quality Evaluation
This repo has [a simple script](./scripts/benchmark_beir.py) to help evaluate the search quality over [BEIR](https://github.com/beir-cellar/beir/tree/main) benchmark. You need to clone the repository (you can also manually download the script and place it under a folder named `scripts`) and you have to install additional dependencies to run the script.
```bash
$ pip install beir
```
To run the script, you may want to specify the dataset name and BM25 hyperparameters. For example:
```bash
$ python scripts/benchmark_beir.py --datasets scifact arguana --bm25-k1 1.2 --n-grams 2 --output eval.csv
```
It automatically downloads the benchmarking dataset to `benchmarks/beir`, where `benchmarks` is at the same level as `scripts`. You may also check the help page for more information.
```bash
$ python scripts/benchmark_beir.py --help
```
New benchmarking scripts are expected to be added in the future.
## License
`bbm25-haystack` is distributed under the terms of the [Apache-2.0](https://spdx.org/licenses/Apache-2.0.html) license.
Raw data
{
"_id": null,
"home_page": null,
"name": "bbm25-haystack",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "BM25, Document Search, Haystack, LLM Agent, RAG",
"author": null,
"author_email": "Yuxuan Wang <wangy49@seas.upenn.edu>",
"download_url": "https://files.pythonhosted.org/packages/72/79/2ece41428b9043be036471d72a4cf59467eb4c82f98c045e3c23ef64b9d9/bbm25_haystack-0.2.1.tar.gz",
"platform": null,
"description": "[![test](https://github.com/Guest400123064/bbm25-haystack/actions/workflows/test.yml/badge.svg)](https://github.com/Guest400123064/bbm25-haystack/actions/workflows/test.yml)\n[![codecov](https://codecov.io/gh/Guest400123064/bbm25-haystack/graph/badge.svg?token=IGRIRBHZ3U)](https://codecov.io/gh/Guest400123064/bbm25-haystack)\n[![code style - Black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n[![types - Mypy](https://img.shields.io/badge/types-Mypy-blue.svg)](https://github.com/python/mypy)\n[![Python 3.9](https://img.shields.io/badge/python-3.9%20|%203.10%20|%203.11-blue.svg)](https://www.python.org/downloads/release/python-390/)\n\n# Better BM25 In-Memory Document Store\n\nAn in-memory document store is a great starting point for prototyping and debugging before migrating to production-grade stores like Elasticsearch. However, [the original implementation](https://github.com/deepset-ai/haystack/blob/0dbb98c0a017b499560521aa93186d0640aab659/haystack/document_stores/in_memory/document_store.py#L148) of BM25 retrieval recreates an inverse index for the entire document store __on every new search__. Furthermore, the tokenization method is primitive, only permitting splitters based on regular expressions, making localization and domain adaptation challenging. Therefore, this implementation is a slight upgrade to the default BM25 in-memory document store by implementing incremental index update and incorporation of [SentencePiece](https://github.com/google/sentencepiece) statistical sub-word tokenization.\n\n## Installation\n\n```bash\n$ pip install bbm25-haystack\n```\n\nAlternatively, you can clone the repository and build from source to be able to reflect changes to the source code:\n\n```bash\n$ git clone https://github.com/Guest400123064/bbm25-haystack.git\n$ cd bbm25-haystack\n$ pip install -e .\n```\n\n## Usage\n\n### Quick Start\n\nBelow is an example of how you can build a minimal search engine with the `bbm25_haystack` components on their own. They are also compatible with [Haystack pipelines](https://docs.haystack.deepset.ai/docs/creating-pipelines).\n\n```python\nfrom haystack import Document\nfrom bbm25_haystack import BetterBM25DocumentStore, BetterBM25Retriever\n\n\ndocument_store = BetterBM25DocumentStore()\ndocument_store.write_documents([\n Document(content=\"There are over 7,000 languages spoken around the world today.\"),\n Document(content=\"Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors.\"),\n Document(content=\"In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bio-luminescent waves.\")\n])\n\nretriever = BetterBM25Retriever(document_store)\nretriever.run(query=\"How many languages are spoken around the world today?\")\n```\n\n### API References\n\nYou can find the full API references [here](https://guest400123064.github.io/bbm25-haystack/). In a hurry? Below are some most important document store parameters you might want explore:\n\n- `k, b, delta` - the [three BM25+ hyperparameters](https://en.wikipedia.org/wiki/Okapi_BM25).\n- `sp_file` - a path to a trained SentencePiece tokenizer `.model` file. The default tokenizer is directly copied from [LLaMA-2-7B-32K tokenizer](https://huggingface.co/togethercomputer/LLaMA-2-7B-32K/blob/main/tokenizer.model) with a vocab size of 32,000.\n- `n_grams` - default to 1, which means text (both query and document) are tokenized into unigrams. If set to 2, the tokenizer also augment the list of uni-grams with bi-grams, and so on. If specified as tuple, e.g., (2, 3), the tokenizer only produce bi-grams and tri-grams, without any uni-gram.\n- `haystack_filter_logic` - see [below](#filtering-logic).\n\nThe retriever parameters are largely the same as [`InMemoryBM25Retriever`](https://docs.haystack.deepset.ai/docs/inmemorybm25retriever).\n\n## Filtering Logic\n\nThe current document store uses [`document_matches_filter`](https://github.com/deepset-ai/haystack/blob/main/haystack/utils/filters.py) shipped with Haystack to perform filtering by default, which is the same as [`InMemoryDocumentStore`](https://docs.haystack.deepset.ai/docs/inmemorydocumentstore).\n\nHowever, there is also an alternative filtering logic shipped with this implementation (unstable at this point). To use this alternative logic, initialize the document store with `haystack_filter_logic=False`. Please find comments and implementation details in [`filters.py`](./src/bbm25_haystack/filters.py). TL;DR:\n\n- Comparison with `None`, i.e., missing values, involved will always return `False`, no matter missing the document attribute value or missing the filter value.\n- Comparison with `pandas.DataFrame` is always prohibited to reduce surprises.\n- No implicit `datetime` conversion from string values.\n- `in` and `not in` allows any `Iterable` as filter value, without the `list` constraint.\n\nIn this case, the negation logic needs to be considered again because `False` can now issue from both input nullity check and the actual comparisons. For instance, `in` and `not in` both yield non-matching upon missing values. But I think having input processing and comparisons separated makes the filtering behavior more transparent.\n\n## Search Quality Evaluation\n\nThis repo has [a simple script](./scripts/benchmark_beir.py) to help evaluate the search quality over [BEIR](https://github.com/beir-cellar/beir/tree/main) benchmark. You need to clone the repository (you can also manually download the script and place it under a folder named `scripts`) and you have to install additional dependencies to run the script.\n\n```bash\n$ pip install beir\n```\n\nTo run the script, you may want to specify the dataset name and BM25 hyperparameters. For example:\n\n```bash\n$ python scripts/benchmark_beir.py --datasets scifact arguana --bm25-k1 1.2 --n-grams 2 --output eval.csv\n```\n\nIt automatically downloads the benchmarking dataset to `benchmarks/beir`, where `benchmarks` is at the same level as `scripts`. You may also check the help page for more information.\n\n```bash\n$ python scripts/benchmark_beir.py --help\n```\n\nNew benchmarking scripts are expected to be added in the future.\n\n## License\n\n`bbm25-haystack` is distributed under the terms of the [Apache-2.0](https://spdx.org/licenses/Apache-2.0.html) license.\n",
"bugtrack_url": null,
"license": null,
"summary": "Haystack 2.x In-memory Document Store with Enhanced Efficiency",
"version": "0.2.1",
"project_urls": {
"Documentation": "https://github.com/Guest400123064/bbm25-haystack#readme",
"Issues": "https://github.com/Guest400123064/bbm25-haystack/issues",
"Source": "https://github.com/Guest400123064/bbm25-haystack"
},
"split_keywords": [
"bm25",
" document search",
" haystack",
" llm agent",
" rag"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "dc34007183c77d4570f9ef1c0660238831acda6adcab1b084bbca3d997c81ef0",
"md5": "5efa08a1c02b658f8430f03b7fe2e703",
"sha256": "6339a538a937c8859058b29a0cb738f9e4130c49c5dc8e15e3259717a4fbc7e3"
},
"downloads": -1,
"filename": "bbm25_haystack-0.2.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "5efa08a1c02b658f8430f03b7fe2e703",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 241610,
"upload_time": "2024-04-27T04:05:47",
"upload_time_iso_8601": "2024-04-27T04:05:47.050312Z",
"url": "https://files.pythonhosted.org/packages/dc/34/007183c77d4570f9ef1c0660238831acda6adcab1b084bbca3d997c81ef0/bbm25_haystack-0.2.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "72792ece41428b9043be036471d72a4cf59467eb4c82f98c045e3c23ef64b9d9",
"md5": "f586f1a9e286139147051cb69e7a71ba",
"sha256": "b8384deeb061976310792580070b913cf78d9260fba51056c5343bac5520c483"
},
"downloads": -1,
"filename": "bbm25_haystack-0.2.1.tar.gz",
"has_sig": false,
"md5_digest": "f586f1a9e286139147051cb69e7a71ba",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 377495,
"upload_time": "2024-04-27T04:05:49",
"upload_time_iso_8601": "2024-04-27T04:05:49.230670Z",
"url": "https://files.pythonhosted.org/packages/72/79/2ece41428b9043be036471d72a4cf59467eb4c82f98c045e3c23ef64b9d9/bbm25_haystack-0.2.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-04-27 04:05:49",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Guest400123064",
"github_project": "bbm25-haystack#readme",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "bbm25-haystack"
}