bbm25-haystack

Name	bbm25-haystack JSON
Version	0.2.1 JSON
	download
home_page	None
Summary	Haystack 2.x In-memory Document Store with Enhanced Efficiency
upload_time	2024-04-27 04:05:49
maintainer	None
docs_url	None
author	None
requires_python	>=3.9
license	None
keywords	bm25 document search haystack llm agent rag
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            [![test](https://github.com/Guest400123064/bbm25-haystack/actions/workflows/test.yml/badge.svg)](https://github.com/Guest400123064/bbm25-haystack/actions/workflows/test.yml)
[![codecov](https://codecov.io/gh/Guest400123064/bbm25-haystack/graph/badge.svg?token=IGRIRBHZ3U)](https://codecov.io/gh/Guest400123064/bbm25-haystack)
[![code style - Black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![types - Mypy](https://img.shields.io/badge/types-Mypy-blue.svg)](https://github.com/python/mypy)
[![Python 3.9](https://img.shields.io/badge/python-3.9%20|%203.10%20|%203.11-blue.svg)](https://www.python.org/downloads/release/python-390/)

# Better BM25 In-Memory Document Store

An in-memory document store is a great starting point for prototyping and debugging before migrating to production-grade stores like Elasticsearch. However, [the original implementation](https://github.com/deepset-ai/haystack/blob/0dbb98c0a017b499560521aa93186d0640aab659/haystack/document_stores/in_memory/document_store.py#L148) of BM25 retrieval recreates an inverse index for the entire document store __on every new search__. Furthermore, the tokenization method is primitive, only permitting splitters based on regular expressions, making localization and domain adaptation challenging. Therefore, this implementation is a slight upgrade to the default BM25 in-memory document store by implementing incremental index update and incorporation of [SentencePiece](https://github.com/google/sentencepiece) statistical sub-word tokenization.

## Installation

```bash
$ pip install bbm25-haystack
```

Alternatively, you can clone the repository and build from source to be able to reflect changes to the source code:

```bash
$ git clone https://github.com/Guest400123064/bbm25-haystack.git
$ cd bbm25-haystack
$ pip install -e .
```

## Usage

### Quick Start

Below is an example of how you can build a minimal search engine with the `bbm25_haystack` components on their own. They are also compatible with [Haystack pipelines](https://docs.haystack.deepset.ai/docs/creating-pipelines).

```python
from haystack import Document
from bbm25_haystack import BetterBM25DocumentStore, BetterBM25Retriever


document_store = BetterBM25DocumentStore()
document_store.write_documents([
   Document(content="There are over 7,000 languages spoken around the world today."),
   Document(content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors."),
   Document(content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bio-luminescent waves.")
])

retriever = BetterBM25Retriever(document_store)
retriever.run(query="How many languages are spoken around the world today?")
```

### API References

You can find the full API references [here](https://guest400123064.github.io/bbm25-haystack/). In a hurry? Below are some most important document store parameters you might want explore:

- `k, b, delta` - the [three BM25+ hyperparameters](https://en.wikipedia.org/wiki/Okapi_BM25).
- `sp_file` - a path to a trained SentencePiece tokenizer `.model` file. The default tokenizer is directly copied from [LLaMA-2-7B-32K tokenizer](https://huggingface.co/togethercomputer/LLaMA-2-7B-32K/blob/main/tokenizer.model) with a vocab size of 32,000.
- `n_grams` - default to 1, which means text (both query and document) are tokenized into unigrams. If set to 2, the tokenizer also augment the list of uni-grams with bi-grams, and so on. If specified as tuple, e.g., (2, 3), the tokenizer only produce bi-grams and tri-grams, without any uni-gram.
- `haystack_filter_logic` - see [below](#filtering-logic).

The retriever parameters are largely the same as [`InMemoryBM25Retriever`](https://docs.haystack.deepset.ai/docs/inmemorybm25retriever).

## Filtering Logic

The current document store uses [`document_matches_filter`](https://github.com/deepset-ai/haystack/blob/main/haystack/utils/filters.py) shipped with Haystack to perform filtering by default, which is the same as [`InMemoryDocumentStore`](https://docs.haystack.deepset.ai/docs/inmemorydocumentstore).

However, there is also an alternative filtering logic shipped with this implementation (unstable at this point). To use this alternative logic, initialize the document store with `haystack_filter_logic=False`. Please find comments and implementation details in [`filters.py`](./src/bbm25_haystack/filters.py). TL;DR:

- Comparison with `None`, i.e., missing values, involved will always return `False`, no matter missing the document attribute value or missing the filter value.
- Comparison with `pandas.DataFrame` is always prohibited to reduce surprises.
- No implicit `datetime` conversion from string values.
- `in` and `not in` allows any `Iterable` as filter value, without the `list` constraint.

In this case, the negation logic needs to be considered again because `False` can now issue from both input nullity check and the actual comparisons. For instance, `in` and `not in` both yield non-matching upon missing values. But I think having input processing and comparisons separated makes the filtering behavior more transparent.

## Search Quality Evaluation

This repo has [a simple script](./scripts/benchmark_beir.py) to help evaluate the search quality over [BEIR](https://github.com/beir-cellar/beir/tree/main) benchmark. You need to clone the repository (you can also manually download the script and place it under a folder named `scripts`) and you have to install additional dependencies to run the script.

```bash
$ pip install beir
```

To run the script, you may want to specify the dataset name and BM25 hyperparameters. For example:

```bash
$ python scripts/benchmark_beir.py --datasets scifact arguana --bm25-k1 1.2 --n-grams 2 --output eval.csv
```

It automatically downloads the benchmarking dataset to `benchmarks/beir`, where `benchmarks` is at the same level as `scripts`. You may also check the help page for more information.

```bash
$ python scripts/benchmark_beir.py --help
```

New benchmarking scripts are expected to be added in the future.

## License

`bbm25-haystack` is distributed under the terms of the [Apache-2.0](https://spdx.org/licenses/Apache-2.0.html) license.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "bbm25-haystack",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "BM25, Document Search, Haystack, LLM Agent, RAG",
    "author": null,
    "author_email": "Yuxuan Wang <wangy49@seas.upenn.edu>",
    "download_url": "https://files.pythonhosted.org/packages/72/79/2ece41428b9043be036471d72a4cf59467eb4c82f98c045e3c23ef64b9d9/bbm25_haystack-0.2.1.tar.gz",
    "platform": null,
    "description": "[![test](https://github.com/Guest400123064/bbm25-haystack/actions/workflows/test.yml/badge.svg)](https://github.com/Guest400123064/bbm25-haystack/actions/workflows/test.yml)\n[![codecov](https://codecov.io/gh/Guest400123064/bbm25-haystack/graph/badge.svg?token=IGRIRBHZ3U)](https://codecov.io/gh/Guest400123064/bbm25-haystack)\n[![code style - Black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n[![types - Mypy](https://img.shields.io/badge/types-Mypy-blue.svg)](https://github.com/python/mypy)\n[![Python 3.9](https://img.shields.io/badge/python-3.9%20|%203.10%20|%203.11-blue.svg)](https://www.python.org/downloads/release/python-390/)\n\n# Better BM25 In-Memory Document Store\n\nAn in-memory document store is a great starting point for prototyping and debugging before migrating to production-grade stores like Elasticsearch. However, [the original implementation](https://github.com/deepset-ai/haystack/blob/0dbb98c0a017b499560521aa93186d0640aab659/haystack/document_stores/in_memory/document_store.py#L148) of BM25 retrieval recreates an inverse index for the entire document store __on every new search__. Furthermore, the tokenization method is primitive, only permitting splitters based on regular expressions, making localization and domain adaptation challenging. Therefore, this implementation is a slight upgrade to the default BM25 in-memory document store by implementing incremental index update and incorporation of [SentencePiece](https://github.com/google/sentencepiece) statistical sub-word tokenization.\n\n## Installation\n\n```bash\n$ pip install bbm25-haystack\n```\n\nAlternatively, you can clone the repository and build from source to be able to reflect changes to the source code:\n\n```bash\n$ git clone https://github.com/Guest400123064/bbm25-haystack.git\n$ cd bbm25-haystack\n$ pip install -e .\n```\n\n## Usage\n\n### Quick Start\n\nBelow is an example of how you can build a minimal search engine with the `bbm25_haystack` components on their own. They are also compatible with [Haystack pipelines](https://docs.haystack.deepset.ai/docs/creating-pipelines).\n\n```python\nfrom haystack import Document\nfrom bbm25_haystack import BetterBM25DocumentStore, BetterBM25Retriever\n\n\ndocument_store = BetterBM25DocumentStore()\ndocument_store.write_documents([\n   Document(content=\"There are over 7,000 languages spoken around the world today.\"),\n   Document(content=\"Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors.\"),\n   Document(content=\"In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bio-luminescent waves.\")\n])\n\nretriever = BetterBM25Retriever(document_store)\nretriever.run(query=\"How many languages are spoken around the world today?\")\n```\n\n### API References\n\nYou can find the full API references [here](https://guest400123064.github.io/bbm25-haystack/). In a hurry? Below are some most important document store parameters you might want explore:\n\n- `k, b, delta` - the [three BM25+ hyperparameters](https://en.wikipedia.org/wiki/Okapi_BM25).\n- `sp_file` - a path to a trained SentencePiece tokenizer `.model` file. The default tokenizer is directly copied from [LLaMA-2-7B-32K tokenizer](https://huggingface.co/togethercomputer/LLaMA-2-7B-32K/blob/main/tokenizer.model) with a vocab size of 32,000.\n- `n_grams` - default to 1, which means text (both query and document) are tokenized into unigrams. If set to 2, the tokenizer also augment the list of uni-grams with bi-grams, and so on. If specified as tuple, e.g., (2, 3), the tokenizer only produce bi-grams and tri-grams, without any uni-gram.\n- `haystack_filter_logic` - see [below](#filtering-logic).\n\nThe retriever parameters are largely the same as [`InMemoryBM25Retriever`](https://docs.haystack.deepset.ai/docs/inmemorybm25retriever).\n\n## Filtering Logic\n\nThe current document store uses [`document_matches_filter`](https://github.com/deepset-ai/haystack/blob/main/haystack/utils/filters.py) shipped with Haystack to perform filtering by default, which is the same as [`InMemoryDocumentStore`](https://docs.haystack.deepset.ai/docs/inmemorydocumentstore).\n\nHowever, there is also an alternative filtering logic shipped with this implementation (unstable at this point). To use this alternative logic, initialize the document store with `haystack_filter_logic=False`. Please find comments and implementation details in [`filters.py`](./src/bbm25_haystack/filters.py). TL;DR:\n\n- Comparison with `None`, i.e., missing values, involved will always return `False`, no matter missing the document attribute value or missing the filter value.\n- Comparison with `pandas.DataFrame` is always prohibited to reduce surprises.\n- No implicit `datetime` conversion from string values.\n- `in` and `not in` allows any `Iterable` as filter value, without the `list` constraint.\n\nIn this case, the negation logic needs to be considered again because `False` can now issue from both input nullity check and the actual comparisons. For instance, `in` and `not in` both yield non-matching upon missing values. But I think having input processing and comparisons separated makes the filtering behavior more transparent.\n\n## Search Quality Evaluation\n\nThis repo has [a simple script](./scripts/benchmark_beir.py) to help evaluate the search quality over [BEIR](https://github.com/beir-cellar/beir/tree/main) benchmark. You need to clone the repository (you can also manually download the script and place it under a folder named `scripts`) and you have to install additional dependencies to run the script.\n\n```bash\n$ pip install beir\n```\n\nTo run the script, you may want to specify the dataset name and BM25 hyperparameters. For example:\n\n```bash\n$ python scripts/benchmark_beir.py --datasets scifact arguana --bm25-k1 1.2 --n-grams 2 --output eval.csv\n```\n\nIt automatically downloads the benchmarking dataset to `benchmarks/beir`, where `benchmarks` is at the same level as `scripts`. You may also check the help page for more information.\n\n```bash\n$ python scripts/benchmark_beir.py --help\n```\n\nNew benchmarking scripts are expected to be added in the future.\n\n## License\n\n`bbm25-haystack` is distributed under the terms of the [Apache-2.0](https://spdx.org/licenses/Apache-2.0.html) license.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Haystack 2.x In-memory Document Store with Enhanced Efficiency",
    "version": "0.2.1",
    "project_urls": {
        "Documentation": "https://github.com/Guest400123064/bbm25-haystack#readme",
        "Issues": "https://github.com/Guest400123064/bbm25-haystack/issues",
        "Source": "https://github.com/Guest400123064/bbm25-haystack"
    },
    "split_keywords": [
        "bm25",
        " document search",
        " haystack",
        " llm agent",
        " rag"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "dc34007183c77d4570f9ef1c0660238831acda6adcab1b084bbca3d997c81ef0",
                "md5": "5efa08a1c02b658f8430f03b7fe2e703",
                "sha256": "6339a538a937c8859058b29a0cb738f9e4130c49c5dc8e15e3259717a4fbc7e3"
            },
            "downloads": -1,
            "filename": "bbm25_haystack-0.2.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "5efa08a1c02b658f8430f03b7fe2e703",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 241610,
            "upload_time": "2024-04-27T04:05:47",
            "upload_time_iso_8601": "2024-04-27T04:05:47.050312Z",
            "url": "https://files.pythonhosted.org/packages/dc/34/007183c77d4570f9ef1c0660238831acda6adcab1b084bbca3d997c81ef0/bbm25_haystack-0.2.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "72792ece41428b9043be036471d72a4cf59467eb4c82f98c045e3c23ef64b9d9",
                "md5": "f586f1a9e286139147051cb69e7a71ba",
                "sha256": "b8384deeb061976310792580070b913cf78d9260fba51056c5343bac5520c483"
            },
            "downloads": -1,
            "filename": "bbm25_haystack-0.2.1.tar.gz",
            "has_sig": false,
            "md5_digest": "f586f1a9e286139147051cb69e7a71ba",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 377495,
            "upload_time": "2024-04-27T04:05:49",
            "upload_time_iso_8601": "2024-04-27T04:05:49.230670Z",
            "url": "https://files.pythonhosted.org/packages/72/79/2ece41428b9043be036471d72a4cf59467eb4c82f98c045e3c23ef64b9d9/bbm25_haystack-0.2.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-27 04:05:49",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Guest400123064",
    "github_project": "bbm25-haystack#readme",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "bbm25-haystack"
}

None