referral-augment

Name	referral-augment JSON
Version	0.1.1 JSON
	download
home_page
Summary	Official implementation of "Referral Augmentation for Zero-Shot Information Retrieval"
upload_time	2023-09-16 21:25:13
maintainer
docs_url	None
author
requires_python	>=3.7
license	MIT License
keywords	retrieval
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Referral-augmented retrieval (RAR)

## Installation

Install with pip:
```
pip install referral-augment
```
Alternatively, install from source:
```
git clone https://github.com/michaelwilliamtang/referral-augment
cd referral-augment
pip install -r requirements.txt
pip install -e .
```

## Overview

Simple, general implementations of referral-augmented retrieval are provided in `rar.retrievers`. We support three aggregation methods — concatenation, mean, and shortest path — as described in the paper, which can be specified via an `AggregationType` constructor argument.

Under our framework, retrieval with BM25 is as simple as:
```python
from rar.retrievers import BM25Retriever
retriever = BM25Retriever(docs, referrals)
retriever.retrieve('paper that introduced the Flickr30k dataset', num_docs=10)
```
Similarly, retrieval with any dense embedding model on HuggingFace:
```python
from rar.retrievers import DenseRetriever, AggregationType
from rar.encoders import HuggingFaceEncoder
encoder = HuggingFaceEncoder('facebook/contriever')
retriever = DenseRetriever(encoder, docs, referrals, aggregation=AggregationType.MEAN)
retriever.retrieve('paper that introduced the Flickr30k dataset', num_docs=10)
```
For convenience, we also include direct implementations of the `SimCSEEncoder` and `SpecterEncoder`.

Example replications of paper results showing the advantage of referral augmentation and demonstrating a full, concise retrieval and evaluation pipeline can be found in `examples.ipynb`.

### Optional: install with SimCSE support

Note that the only stable version of SimCSE is currently using its [source](https://github.com/princeton-nlp/SimCSE) as a module, which requires building `rar` from source. Thus, optionally to install with support for `SimCSEEncoder`:
```
git clone https://github.com/michaelwilliamtang/referral-augment
cd referral-augment
pip install -r requirements.txt
cd src/rar/encoders
git clone https://github.com/princeton-nlp/SimCSE
cd SimCSE
pip install -r requirements.txt
cd ../../../..
```

## Data

We provide sample data in zipped form [here](https://drive.google.com/file/d/1IVo3sJ-H5i17KdQq4-kBr9oL64KLxtEc/view?usp=sharing) — to use, unzip and place `data/` under the repository's root directory.

Our sample data covers two domains, each with a *corpus* of documents and referrals and an evaluation *dataset* of queries and ground truth documents. Under the `paper_retrieval` domain, we include the `acl`, `acl_small`, and `arxiv` corpuses and datasets, and under the `entity_retrieval` domain, we include the `dbpedia_small` corpus and dataset.

Construction details:
- The `acl_small`, `acl`, and `arxiv` corpuses are constructed from the rich paper metadata parses provided by Allen AI's [S2ORC](https://github.com/allenai/s2orc) project. Documents consist of concatenated paper titles and abstracts from up-to-2017 ACL and ArXiv papers, respectively, and referrals consist of in-text citations between up-to-2017 papers. The respective evaluation datasets are also from the parses, consisting of in-text citations from *2018-and-on* papers citing the up-to-2017 papers in the corpus -- this time-based split prevents data leakage and mirrors deployment conditions.
- The `dbpedia_small` corpus and dataset is sampled from the DBPedia task in the [BEIR](https://github.com/beir-cellar/beir) benchmark. Referrals are mined from Wikipedia HTML using [WikiExtractor](https://github.com/attardi/wikiextractor).

Data can be loaded via our utility functions at `rar.utils`:
```python
from rar.utils import load_corpus, load_eval_dataset
docs, referrals = load_corpus(domain='paper_retrieval', corpus='acl_small')
queries, ground_truth = load_eval_dataset(domain='paper_retrieval', dataset='acl_small')
```
Our data representations are simple and intuitive:
— A `corpus` is a lists of document strings
— A set of `referrals` is a list of lists of document strings (one list of referrals per document)
Similarly:
— A set of `queries` is a lists of query strings
— The corresponding `ground_truth` is *either* a list of document strings (one ground truth document per query, e.g. the cited paper in paper retrieval) *or* a list of lists of document strings (multiple relevant ground truth documents per query, e.g. all relevant Wikipedia pages for a given `dbpedia_small` query)

### Custom data

Creating a corpus is as simple as constructing these lists (referrals are optional). For example:
```python
docs = ['Steve Jobs was a revolutionary technological thinker and designer', "Bill Gates founded the world's largest software company"]
referrals = [['Apple CEO', 'Magic Leap founder'], ['Microsoft CEO', 'The Giving Pledge co-founder']]

retriever = BM25Retriever(docs, referrals)
```
Creating an evaluation corpus is similarly easy:
```python
queries = ['Who built the Apple Macintosh?']
ground_truth = [docs[0]]
```

## Evaluation

We implement the Recall@k and MRR metrics under `rar.metrics`, which can be used standalone or with our utility functions at `rar.utils`:
```python
from rar.utils import evaluate_retriever
evaluate_retriever(retriever, queries, ground_truth)
```
By default, `evaluate_retrieval` attempts to compute MRR, Recall@1, and Recall@10 metrics. The keyword parameter `multiple_correct=False` removes MRR, since it does not support multiple ground truth documents per query (e.g. for `dbpedia_small`). See `examples.ipynb` for example outputs.

If you find this repository helpful, feel free to cite our publication [Referral Augmentation for Zero-Shot Information Retrieval](https://arxiv.org/abs/2305.15098):

```
@misc{tang2023referral,
      title={Referral Augmentation for Zero-Shot Information Retrieval}, 
      author={Michael Tang and Shunyu Yao and John Yang and Karthik Narasimhan},
      year={2023},
      eprint={2305.15098},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "referral-augment",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "retrieval",
    "author": "",
    "author_email": "Michael Tang <mwtang@princeton.edu>",
    "download_url": "https://files.pythonhosted.org/packages/68/c6/7729b125749664ca901be49c6ae0552c2bfe5c84933a7ba70ca2e7095e97/referral-augment-0.1.1.tar.gz",
    "platform": null,
    "description": "# Referral-augmented retrieval (RAR)\n\n## Installation\n\nInstall with pip:\n```\npip install referral-augment\n```\nAlternatively, install from source:\n```\ngit clone https://github.com/michaelwilliamtang/referral-augment\ncd referral-augment\npip install -r requirements.txt\npip install -e .\n```\n\n## Overview\n\nSimple, general implementations of referral-augmented retrieval are provided in `rar.retrievers`. We support three aggregation methods \u2014 concatenation, mean, and shortest path \u2014 as described in the paper, which can be specified via an `AggregationType` constructor argument.\n\nUnder our framework, retrieval with BM25 is as simple as:\n```python\nfrom rar.retrievers import BM25Retriever\nretriever = BM25Retriever(docs, referrals)\nretriever.retrieve('paper that introduced the Flickr30k dataset', num_docs=10)\n```\nSimilarly, retrieval with any dense embedding model on HuggingFace:\n```python\nfrom rar.retrievers import DenseRetriever, AggregationType\nfrom rar.encoders import HuggingFaceEncoder\nencoder = HuggingFaceEncoder('facebook/contriever')\nretriever = DenseRetriever(encoder, docs, referrals, aggregation=AggregationType.MEAN)\nretriever.retrieve('paper that introduced the Flickr30k dataset', num_docs=10)\n```\nFor convenience, we also include direct implementations of the `SimCSEEncoder` and `SpecterEncoder`.\n\nExample replications of paper results showing the advantage of referral augmentation and demonstrating a full, concise retrieval and evaluation pipeline can be found in `examples.ipynb`.\n\n### Optional: install with SimCSE support\n\nNote that the only stable version of SimCSE is currently using its [source](https://github.com/princeton-nlp/SimCSE) as a module, which requires building `rar` from source. Thus, optionally to install with support for `SimCSEEncoder`:\n```\ngit clone https://github.com/michaelwilliamtang/referral-augment\ncd referral-augment\npip install -r requirements.txt\ncd src/rar/encoders\ngit clone https://github.com/princeton-nlp/SimCSE\ncd SimCSE\npip install -r requirements.txt\ncd ../../../..\n```\n\n## Data\n\nWe provide sample data in zipped form [here](https://drive.google.com/file/d/1IVo3sJ-H5i17KdQq4-kBr9oL64KLxtEc/view?usp=sharing) \u2014 to use, unzip and place `data/` under the repository's root directory.\n\nOur sample data covers two domains, each with a *corpus* of documents and referrals and an evaluation *dataset* of queries and ground truth documents. Under the `paper_retrieval` domain, we include the `acl`, `acl_small`, and `arxiv` corpuses and datasets, and under the `entity_retrieval` domain, we include the `dbpedia_small` corpus and dataset.\n\nConstruction details:\n- The `acl_small`, `acl`, and `arxiv` corpuses are constructed from the rich paper metadata parses provided by Allen AI's [S2ORC](https://github.com/allenai/s2orc) project. Documents consist of concatenated paper titles and abstracts from up-to-2017 ACL and ArXiv papers, respectively, and referrals consist of in-text citations between up-to-2017 papers. The respective evaluation datasets are also from the parses, consisting of in-text citations from *2018-and-on* papers citing the up-to-2017 papers in the corpus -- this time-based split prevents data leakage and mirrors deployment conditions.\n- The `dbpedia_small` corpus and dataset is sampled from the DBPedia task in the [BEIR](https://github.com/beir-cellar/beir) benchmark. Referrals are mined from Wikipedia HTML using [WikiExtractor](https://github.com/attardi/wikiextractor).\n\nData can be loaded via our utility functions at `rar.utils`:\n```python\nfrom rar.utils import load_corpus, load_eval_dataset\ndocs, referrals = load_corpus(domain='paper_retrieval', corpus='acl_small')\nqueries, ground_truth = load_eval_dataset(domain='paper_retrieval', dataset='acl_small')\n```\nOur data representations are simple and intuitive:\n\u2014 A `corpus` is a lists of document strings\n\u2014 A set of `referrals` is a list of lists of document strings (one list of referrals per document)\nSimilarly:\n\u2014 A set of `queries` is a lists of query strings\n\u2014 The corresponding `ground_truth` is *either* a list of document strings (one ground truth document per query, e.g. the cited paper in paper retrieval) *or* a list of lists of document strings (multiple relevant ground truth documents per query, e.g. all relevant Wikipedia pages for a given `dbpedia_small` query)\n\n### Custom data\n\nCreating a corpus is as simple as constructing these lists (referrals are optional). For example:\n```python\ndocs = ['Steve Jobs was a revolutionary technological thinker and designer', \"Bill Gates founded the world's largest software company\"]\nreferrals = [['Apple CEO', 'Magic Leap founder'], ['Microsoft CEO', 'The Giving Pledge co-founder']]\n\nretriever = BM25Retriever(docs, referrals)\n```\nCreating an evaluation corpus is similarly easy:\n```python\nqueries = ['Who built the Apple Macintosh?']\nground_truth = [docs[0]]\n```\n\n## Evaluation\n\nWe implement the Recall@k and MRR metrics under `rar.metrics`, which can be used standalone or with our utility functions at `rar.utils`:\n```python\nfrom rar.utils import evaluate_retriever\nevaluate_retriever(retriever, queries, ground_truth)\n```\nBy default, `evaluate_retrieval` attempts to compute MRR, Recall@1, and Recall@10 metrics. The keyword parameter `multiple_correct=False` removes MRR, since it does not support multiple ground truth documents per query (e.g. for `dbpedia_small`). See `examples.ipynb` for example outputs.\n\nIf you find this repository helpful, feel free to cite our publication [Referral Augmentation for Zero-Shot Information Retrieval](https://arxiv.org/abs/2305.15098):\n\n```\n@misc{tang2023referral,\n      title={Referral Augmentation for Zero-Shot Information Retrieval}, \n      author={Michael Tang and Shunyu Yao and John Yang and Karthik Narasimhan},\n      year={2023},\n      eprint={2305.15098},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL}\n}\n```\n",
    "bugtrack_url": null,
    "license": "MIT License",
    "summary": "Official implementation of \"Referral Augmentation for Zero-Shot Information Retrieval\"",
    "version": "0.1.1",
    "project_urls": {
        "Homepage": "https://github.com/michaelwilliamtang/referral-augment"
    },
    "split_keywords": [
        "retrieval"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8db704d88ff331dbe1583763e485eb7b3ee13bac5e470afafe14903efeae30c0",
                "md5": "03b326370d56ff9b7859d76e3f60557d",
                "sha256": "b1d97ef04985563d0f5010b517086008aff638ed9a5451bf1c1eacecbbdef58f"
            },
            "downloads": -1,
            "filename": "referral_augment-0.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "03b326370d56ff9b7859d76e3f60557d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 15192,
            "upload_time": "2023-09-16T21:25:11",
            "upload_time_iso_8601": "2023-09-16T21:25:11.359577Z",
            "url": "https://files.pythonhosted.org/packages/8d/b7/04d88ff331dbe1583763e485eb7b3ee13bac5e470afafe14903efeae30c0/referral_augment-0.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "68c67729b125749664ca901be49c6ae0552c2bfe5c84933a7ba70ca2e7095e97",
                "md5": "0c4e70c14a719c5f48b376fe6d7a0ca5",
                "sha256": "b8e9859a872fc6a3fe33afb609c3318dfe8fdd588f3e7fb8c4951825075328b1"
            },
            "downloads": -1,
            "filename": "referral-augment-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "0c4e70c14a719c5f48b376fe6d7a0ca5",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 13096,
            "upload_time": "2023-09-16T21:25:13",
            "upload_time_iso_8601": "2023-09-16T21:25:13.117805Z",
            "url": "https://files.pythonhosted.org/packages/68/c6/7729b125749664ca901be49c6ae0552c2bfe5c84933a7ba70ca2e7095e97/referral-augment-0.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-09-16 21:25:13",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "michaelwilliamtang",
    "github_project": "referral-augment",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "referral-augment"
}