sprint-toolkit

Name	sprint-toolkit JSON
Version	0.0.3 JSON
	download
home_page	https://github.com/thakur-nandan/sprint
Summary	SPRINT: A Unified Toolkit for Evaluating and Demystifying Zero-shot Neural Sparse Retrieval
upload_time	2023-06-21 18:40:24
maintainer
docs_url	None
author	Nandan Thakur
requires_python	>=3.7
license
keywords	information retrieval toolkit sparse retrievers networks bert pytorch ir nlp deep learning
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <h1 align="center">
<img style="vertical-align:middle" width="620" height="120" src="./images/sprint-logo.png" />
</h1>

<p align="center">
    <a href="https://github.com/thakur-nandan/sprint/releases">
        <img alt="GitHub release" src="https://img.shields.io/badge/release-v0.0.1-blue">
    </a>
    <a href="https://www.python.org/">
            <img alt="Build" src="https://img.shields.io/badge/Made%20with-Python-1f425f.svg?color=purple">
    </a>
    <a href="https://github.com/thakur-nandan/sprint/blob/master/LICENSE">
        <img alt="License" src="https://img.shields.io/github/license/thakur-nandan/sprint.svg?color=green">
    </a>
    <!-- <a href="https://colab.research.google.com/drive/1HfutiEhHMJLXiWGT8pcipxT5L2TpYEdt?usp=sharing">
        <img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg">
    </a> -->
    <a href="https://pepy.tech/project/sprint-toolkit">
        <img alt="Downloads" src="https://static.pepy.tech/personalized-badge/sprint?period=total&units=international_system&left_color=grey&right_color=orange&left_text=Downloads">
    </a>
    <a href="https://github.com/thakur-nandan/sprint/">
        <img alt="Downloads" src="https://badges.frapsoft.com/os/v1/open-source.svg?v=103">
    </a>
</p>

<h3 align="center">
    <a href="https://uwaterloo.ca"><img style="float: left; padding: 2px 7px 2px 7px;" width="213" height="67" src="./images/uwaterloo.png" /></a>
    <a href="http://www.ukp.tu-darmstadt.de"><img style="float: middle; padding: 2px 7px 2px 7px;" width="147" height="67" src="./images/ukp.png" /></a>
    <a href="https://www.tu-darmstadt.de/"><img style="float: right; padding: 2px 7px 2px 7px;" width="167.7" height="60" src="./images/tu-darmstadt.png" /></a>
</h3>

### SPRINT provides a _unified_ repository to easily _evaluate_ diverse state-of-the-art neural (BERT-based) sparse-retrieval models.

SPRINT toolkit allows you to easily search or evaluate any neural sparse retriever across **any** dataset in the BEIR benchmark (or your own dataset). The toolkit is built around as a useful wrapper around Pyserini. The toolkit provides evaluation of seven diverse (neural) sparse retrieval models: [SPLADEv2](https://arxiv.org/abs/2109.10086), [BT-SPLADE-L](https://arxiv.org/abs/2207.03834), [uniCOIL](https://arxiv.org/abs/2106.14807), [TILDEv2](https://arxiv.org/abs/2108.08513), [DeepImpact](https://arxiv.org/abs/2104.12016), [DocT5query](https://cs.uwaterloo.ca/~jimmylin/publications/Nogueira_Lin_2019_docTTTTTquery-v2.pdf) and [SPARTA](https://aclanthology.org/2021.naacl-main.47/).




If you want to read more about the SPRINT toolkit, or wish to know which model to use, please refer to our paper for more details:

* [**SPRINT: A Unified Toolkit for Evaluating and Demystifying Zero-shot Neural Sparse Retrieval**]() (Accepted at SIGIR'23 Resource Track)

## :runner: Getting Started
SPRINT is backed by Pyserini which relies on Java. To make the installation eaiser, we recommend to follow the steps below via `conda`:

```bash
#### Create a new conda environment using conda ####
$ conda create -n sprint_env python=3.8
$ conda activate sprint_env

# Install JDK 11 via conda
$ conda install -c conda-forge openjdk=11

# Install SPRINT toolkit using PyPI
$ pip install sprint-toolkit
```

## :runner: Quickstart with SPRINT Toolkit

### Quick start
For a quick start, we can go to the [example](examples/inference/distilsplade_max/beir_scifact/all_in_one.sh) for evaluating SPLADE (`distilsplade_max`) on the BeIR/SciFact dataset:
```bash
cd examples/inference/distilsplade_max/beir_scifact
bash all_in_one.sh
```
This will go over the whole pipeline and give the final evaluation results in `beir_scifact-distilsplade_max-quantized/evaluation/metrics.json`:

<details>
  <summary>Results: distilsplade_max on BeIR/SciFact</summary>
  
  ```bash
     cat beir_scifact-distilsplade_max-quantized/evaluation/metrics.json 
     # {
     #     "nDCG": {
     #         "NDCG@1": 0.60333,
     #         "NDCG@3": 0.65969,
     #         "NDCG@5": 0.67204,
     #         "NDCG@10": 0.6925,
     #         "NDCG@100": 0.7202,
     #         "NDCG@1000": 0.72753
     #     },
     #     "MAP": {
     #         "MAP@1": 0.57217,
     #     ...
     # }
  ```
</details>

Or if you like running python directly, just run the code snippet below for evaluating `castorini/unicoil-noexp-msmarco-passage` on `BeIR/SciFact`:
```python
from sprint.inference import aio


if __name__ == '__main__':  # aio.run can only be called within __main__
    aio.run(
        encoder_name='unicoil',
        ckpt_name='castorini/unicoil-noexp-msmarco-passage',
        data_name='beir/scifact',
        gpus=[0, 1],
        output_dir='beir_scifact-unicoil_noexp',
        do_quantization=True,
        quantization_method='range-nbits',  # So the doc term weights will be quantized by `(term_weights / 5) * (2 ** 8)`
        original_score_range=5,
        quantization_nbits=8,
        original_query_format='beir',
        topic_split='test'
    )
    # You would get "NDCG@10": 0.68563
```
### Step by step
One can also run the above process in 6 separate steps under the [step_by_step](examples/inference/distilsplade_max/beir_scifact/step_by_step) folder:
1. [encode](examples/inference/distilsplade_max/beir_scifact/step_by_step/1.encode.beir_scifact-distilsplade_max-float.sh): Encode documents into term weights by multiprocessing on mutliple GPUs;
2. [quantize](examples/inference/distilsplade_max/beir_scifact/step_by_step/2.quantize.beir_scifact-distilsplade_max-2digits.sh): Quantize the document term weights into integers (can be scaped);
3. [index](examples/inference/distilsplade_max/beir_scifact/step_by_step/3.index.beir_scifact-distilsplade_max-2digits.sh): Index the term weights in to Lucene index (backended by Pyserini);
4. [reformat](examples/inference/distilsplade_max/beir_scifact/step_by_step/4.reformat_query.beir_scifact.sh): Reformat the queries file (e.g. the ones from BeIR) into the Pyserini format;
5. [search](examples/inference/distilsplade_max/beir_scifact/step_by_step/5.search.beir_scifact-distilsplade_max-2digits.sh): Retrieve the relevant documents (backended by Pyserini);
6. [evaluate](examples/inference/distilsplade_max/beir_scifact/step_by_step/6.evaluate.beir_scifact-distilsplade_max-2digits.sh): Evaluate the results against a certain labeled data, e.g.the qrels used in BeIR (backended by BeIR)

Currently it **directly** supports methods (with reproduction verified):
- uniCOIL;
- SPLADE: Go to [examples/inference/distilsplade_max/beir_scifact](examples/inference/distilsplade_max/beir_scifact) for fast reproducing `distilsplade_max` on SciFact;
- SPARTA;
- TILDEv2: Go to [examples/inference/tildev2-noexp/trecdl2019](examples/inference/tildev2-noexp/trecdl2019) for fast reproducing `ielab/TILDEv2-noExp` reranking on TREC-DL 2019;
- DeepImpact

Currently it supports data formats (by downloading automatically):
- BeIR

Other models and data (formats) will be added.

### Custom encoders
To add a custom encoder, one can refer to the example [examples/inference/custom_encoder/beir_scifact](examples/inference/custom_encoder/beir_scifact), where `distilsplade_max` is evaluated on `BeIR/SciFact` **with stopwords filtered out**.

In detail, one just needs to define your custom encoder class and write a new encoder builder function:
```python
from typing import Dict, List
from pyserini.encode import QueryEncoder, DocumentEncoder

class CustomQueryEncoder(QueryEncoder):

    def encode(self, text, **kwargs) -> Dict[str, float]:
        # Just an example:
        terms = text.split()
        term_weights = {term: 1 for term in terms}
        return term_weights  # Dict object, where keys/values are terms/term scores, resp.

class CustomDocumentEncoder(DocumentEncoder):

    def encode(self, texts, **kwargs) -> List[Dict[str, float]]:
        # Just an example:
        term_weights_batch = []
        for text in texts:
            terms = text.split()
            term_weights = {term: 1 for term in terms}
            term_weights_batch.append(term_weights)
        return term_weights_batch 

def custom_encoder_builder(ckpt_name, etype, device='cpu'):
    if etype == 'query':
        return CustomQueryEncoder(ckpt_name, device=device)        
    elif etype == 'document':
        return CustomDocumentEncoder(ckpt_name, device=device)
    else:
        raise ValueError
```
Then register `custom_encoder_builder` with `sprint.inference.encoder_builders.register` before usage:
```python
from sprint.inference.encoder_builders import register

register('custom_encoder_builder', custom_encoder_builder)
```

## Training (Experimental)
Will be added.

## Contacts
The main contributors of this repository are:

- [Nandan Thakur](https://github.com/Nthakur20)
- [Kexin Wang](https://github.com/kwang2049)

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/thakur-nandan/sprint",
    "name": "sprint-toolkit",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "Information Retrieval Toolkit Sparse Retrievers Networks BERT PyTorch IR NLP deep learning",
    "author": "Nandan Thakur",
    "author_email": "nandant@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/f9/8b/9a31e775265df965adeb9e4ae0d4ccf8e645607a890bc4471af91ac84641/sprint-toolkit-0.0.3.tar.gz",
    "platform": null,
    "description": "<h1 align=\"center\">\n<img style=\"vertical-align:middle\" width=\"620\" height=\"120\" src=\"./images/sprint-logo.png\" />\n</h1>\n\n<p align=\"center\">\n    <a href=\"https://github.com/thakur-nandan/sprint/releases\">\n        <img alt=\"GitHub release\" src=\"https://img.shields.io/badge/release-v0.0.1-blue\">\n    </a>\n    <a href=\"https://www.python.org/\">\n            <img alt=\"Build\" src=\"https://img.shields.io/badge/Made%20with-Python-1f425f.svg?color=purple\">\n    </a>\n    <a href=\"https://github.com/thakur-nandan/sprint/blob/master/LICENSE\">\n        <img alt=\"License\" src=\"https://img.shields.io/github/license/thakur-nandan/sprint.svg?color=green\">\n    </a>\n    <!-- <a href=\"https://colab.research.google.com/drive/1HfutiEhHMJLXiWGT8pcipxT5L2TpYEdt?usp=sharing\">\n        <img alt=\"Open In Colab\" src=\"https://colab.research.google.com/assets/colab-badge.svg\">\n    </a> -->\n    <a href=\"https://pepy.tech/project/sprint-toolkit\">\n        <img alt=\"Downloads\" src=\"https://static.pepy.tech/personalized-badge/sprint?period=total&units=international_system&left_color=grey&right_color=orange&left_text=Downloads\">\n    </a>\n    <a href=\"https://github.com/thakur-nandan/sprint/\">\n        <img alt=\"Downloads\" src=\"https://badges.frapsoft.com/os/v1/open-source.svg?v=103\">\n    </a>\n</p>\n\n<h3 align=\"center\">\n    <a href=\"https://uwaterloo.ca\"><img style=\"float: left; padding: 2px 7px 2px 7px;\" width=\"213\" height=\"67\" src=\"./images/uwaterloo.png\" /></a>\n    <a href=\"http://www.ukp.tu-darmstadt.de\"><img style=\"float: middle; padding: 2px 7px 2px 7px;\" width=\"147\" height=\"67\" src=\"./images/ukp.png\" /></a>\n    <a href=\"https://www.tu-darmstadt.de/\"><img style=\"float: right; padding: 2px 7px 2px 7px;\" width=\"167.7\" height=\"60\" src=\"./images/tu-darmstadt.png\" /></a>\n</h3>\n\n### SPRINT provides a _unified_ repository to easily _evaluate_ diverse state-of-the-art neural (BERT-based) sparse-retrieval models.\n\nSPRINT toolkit allows you to easily search or evaluate any neural sparse retriever across **any** dataset in the BEIR benchmark (or your own dataset). The toolkit is built around as a useful wrapper around Pyserini. The toolkit provides evaluation of seven diverse (neural) sparse retrieval models: [SPLADEv2](https://arxiv.org/abs/2109.10086), [BT-SPLADE-L](https://arxiv.org/abs/2207.03834), [uniCOIL](https://arxiv.org/abs/2106.14807), [TILDEv2](https://arxiv.org/abs/2108.08513), [DeepImpact](https://arxiv.org/abs/2104.12016), [DocT5query](https://cs.uwaterloo.ca/~jimmylin/publications/Nogueira_Lin_2019_docTTTTTquery-v2.pdf) and [SPARTA](https://aclanthology.org/2021.naacl-main.47/).\n\n\n\n\nIf you want to read more about the SPRINT toolkit, or wish to know which model to use, please refer to our paper for more details:\n\n* [**SPRINT: A Unified Toolkit for Evaluating and Demystifying Zero-shot Neural Sparse Retrieval**]() (Accepted at SIGIR'23 Resource Track)\n\n## :runner: Getting Started\nSPRINT is backed by Pyserini which relies on Java. To make the installation eaiser, we recommend to follow the steps below via `conda`:\n\n```bash\n#### Create a new conda environment using conda ####\n$ conda create -n sprint_env python=3.8\n$ conda activate sprint_env\n\n# Install JDK 11 via conda\n$ conda install -c conda-forge openjdk=11\n\n# Install SPRINT toolkit using PyPI\n$ pip install sprint-toolkit\n```\n\n## :runner: Quickstart with SPRINT Toolkit\n\n### Quick start\nFor a quick start, we can go to the [example](examples/inference/distilsplade_max/beir_scifact/all_in_one.sh) for evaluating SPLADE (`distilsplade_max`) on the BeIR/SciFact dataset:\n```bash\ncd examples/inference/distilsplade_max/beir_scifact\nbash all_in_one.sh\n```\nThis will go over the whole pipeline and give the final evaluation results in `beir_scifact-distilsplade_max-quantized/evaluation/metrics.json`:\n\n<details>\n  <summary>Results: distilsplade_max on BeIR/SciFact</summary>\n  \n  ```bash\n     cat beir_scifact-distilsplade_max-quantized/evaluation/metrics.json \n     # {\n     #     \"nDCG\": {\n     #         \"NDCG@1\": 0.60333,\n     #         \"NDCG@3\": 0.65969,\n     #         \"NDCG@5\": 0.67204,\n     #         \"NDCG@10\": 0.6925,\n     #         \"NDCG@100\": 0.7202,\n     #         \"NDCG@1000\": 0.72753\n     #     },\n     #     \"MAP\": {\n     #         \"MAP@1\": 0.57217,\n     #     ...\n     # }\n  ```\n</details>\n\nOr if you like running python directly, just run the code snippet below for evaluating `castorini/unicoil-noexp-msmarco-passage` on `BeIR/SciFact`:\n```python\nfrom sprint.inference import aio\n\n\nif __name__ == '__main__':  # aio.run can only be called within __main__\n    aio.run(\n        encoder_name='unicoil',\n        ckpt_name='castorini/unicoil-noexp-msmarco-passage',\n        data_name='beir/scifact',\n        gpus=[0, 1],\n        output_dir='beir_scifact-unicoil_noexp',\n        do_quantization=True,\n        quantization_method='range-nbits',  # So the doc term weights will be quantized by `(term_weights / 5) * (2 ** 8)`\n        original_score_range=5,\n        quantization_nbits=8,\n        original_query_format='beir',\n        topic_split='test'\n    )\n    # You would get \"NDCG@10\": 0.68563\n```\n### Step by step\nOne can also run the above process in 6 separate steps under the [step_by_step](examples/inference/distilsplade_max/beir_scifact/step_by_step) folder:\n1. [encode](examples/inference/distilsplade_max/beir_scifact/step_by_step/1.encode.beir_scifact-distilsplade_max-float.sh): Encode documents into term weights by multiprocessing on mutliple GPUs;\n2. [quantize](examples/inference/distilsplade_max/beir_scifact/step_by_step/2.quantize.beir_scifact-distilsplade_max-2digits.sh): Quantize the document term weights into integers (can be scaped);\n3. [index](examples/inference/distilsplade_max/beir_scifact/step_by_step/3.index.beir_scifact-distilsplade_max-2digits.sh): Index the term weights in to Lucene index (backended by Pyserini);\n4. [reformat](examples/inference/distilsplade_max/beir_scifact/step_by_step/4.reformat_query.beir_scifact.sh): Reformat the queries file (e.g. the ones from BeIR) into the Pyserini format;\n5. [search](examples/inference/distilsplade_max/beir_scifact/step_by_step/5.search.beir_scifact-distilsplade_max-2digits.sh): Retrieve the relevant documents (backended by Pyserini);\n6. [evaluate](examples/inference/distilsplade_max/beir_scifact/step_by_step/6.evaluate.beir_scifact-distilsplade_max-2digits.sh): Evaluate the results against a certain labeled data, e.g.the qrels used in BeIR (backended by BeIR)\n\nCurrently it **directly** supports methods (with reproduction verified):\n- uniCOIL;\n- SPLADE: Go to [examples/inference/distilsplade_max/beir_scifact](examples/inference/distilsplade_max/beir_scifact) for fast reproducing `distilsplade_max` on SciFact;\n- SPARTA;\n- TILDEv2: Go to [examples/inference/tildev2-noexp/trecdl2019](examples/inference/tildev2-noexp/trecdl2019) for fast reproducing `ielab/TILDEv2-noExp` reranking on TREC-DL 2019;\n- DeepImpact\n\nCurrently it supports data formats (by downloading automatically):\n- BeIR\n\nOther models and data (formats) will be added.\n\n### Custom encoders\nTo add a custom encoder, one can refer to the example [examples/inference/custom_encoder/beir_scifact](examples/inference/custom_encoder/beir_scifact), where `distilsplade_max` is evaluated on `BeIR/SciFact` **with stopwords filtered out**.\n\nIn detail, one just needs to define your custom encoder class and write a new encoder builder function:\n```python\nfrom typing import Dict, List\nfrom pyserini.encode import QueryEncoder, DocumentEncoder\n\nclass CustomQueryEncoder(QueryEncoder):\n\n    def encode(self, text, **kwargs) -> Dict[str, float]:\n        # Just an example:\n        terms = text.split()\n        term_weights = {term: 1 for term in terms}\n        return term_weights  # Dict object, where keys/values are terms/term scores, resp.\n\nclass CustomDocumentEncoder(DocumentEncoder):\n\n    def encode(self, texts, **kwargs) -> List[Dict[str, float]]:\n        # Just an example:\n        term_weights_batch = []\n        for text in texts:\n            terms = text.split()\n            term_weights = {term: 1 for term in terms}\n            term_weights_batch.append(term_weights)\n        return term_weights_batch \n\ndef custom_encoder_builder(ckpt_name, etype, device='cpu'):\n    if etype == 'query':\n        return CustomQueryEncoder(ckpt_name, device=device)        \n    elif etype == 'document':\n        return CustomDocumentEncoder(ckpt_name, device=device)\n    else:\n        raise ValueError\n```\nThen register `custom_encoder_builder` with `sprint.inference.encoder_builders.register` before usage:\n```python\nfrom sprint.inference.encoder_builders import register\n\nregister('custom_encoder_builder', custom_encoder_builder)\n```\n\n## Training (Experimental)\nWill be added.\n\n## Contacts\nThe main contributors of this repository are:\n\n- [Nandan Thakur](https://github.com/Nthakur20)\n- [Kexin Wang](https://github.com/kwang2049)\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "SPRINT: A Unified Toolkit for Evaluating and Demystifying Zero-shot Neural Sparse Retrieval",
    "version": "0.0.3",
    "project_urls": {
        "Bug Tracker": "https://github.com/thakur-nandan/sprint/issues",
        "Homepage": "https://github.com/thakur-nandan/sprint"
    },
    "split_keywords": [
        "information",
        "retrieval",
        "toolkit",
        "sparse",
        "retrievers",
        "networks",
        "bert",
        "pytorch",
        "ir",
        "nlp",
        "deep",
        "learning"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f98b9a31e775265df965adeb9e4ae0d4ccf8e645607a890bc4471af91ac84641",
                "md5": "ea492b433949a976ee4f0c707b3e8bcb",
                "sha256": "a85ccc8904d56b1196e9275f57dc7d5d402b0ce5038d6aee034f8bd98db7cf8b"
            },
            "downloads": -1,
            "filename": "sprint-toolkit-0.0.3.tar.gz",
            "has_sig": false,
            "md5_digest": "ea492b433949a976ee4f0c707b3e8bcb",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 48776,
            "upload_time": "2023-06-21T18:40:24",
            "upload_time_iso_8601": "2023-06-21T18:40:24.730301Z",
            "url": "https://files.pythonhosted.org/packages/f9/8b/9a31e775265df965adeb9e4ae0d4ccf8e645607a890bc4471af91ac84641/sprint-toolkit-0.0.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-06-21 18:40:24",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "thakur-nandan",
    "github_project": "sprint",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "sprint-toolkit"
}

Nandan Thakur