Name | referral-augment JSON |
Version |
0.1.1
JSON |
| download |
home_page | |
Summary | Official implementation of "Referral Augmentation for Zero-Shot Information Retrieval" |
upload_time | 2023-09-16 21:25:13 |
maintainer | |
docs_url | None |
author | |
requires_python | >=3.7 |
license | MIT License |
keywords |
retrieval
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# Referral-augmented retrieval (RAR)
## Installation
Install with pip:
```
pip install referral-augment
```
Alternatively, install from source:
```
git clone https://github.com/michaelwilliamtang/referral-augment
cd referral-augment
pip install -r requirements.txt
pip install -e .
```
## Overview
Simple, general implementations of referral-augmented retrieval are provided in `rar.retrievers`. We support three aggregation methods — concatenation, mean, and shortest path — as described in the paper, which can be specified via an `AggregationType` constructor argument.
Under our framework, retrieval with BM25 is as simple as:
```python
from rar.retrievers import BM25Retriever
retriever = BM25Retriever(docs, referrals)
retriever.retrieve('paper that introduced the Flickr30k dataset', num_docs=10)
```
Similarly, retrieval with any dense embedding model on HuggingFace:
```python
from rar.retrievers import DenseRetriever, AggregationType
from rar.encoders import HuggingFaceEncoder
encoder = HuggingFaceEncoder('facebook/contriever')
retriever = DenseRetriever(encoder, docs, referrals, aggregation=AggregationType.MEAN)
retriever.retrieve('paper that introduced the Flickr30k dataset', num_docs=10)
```
For convenience, we also include direct implementations of the `SimCSEEncoder` and `SpecterEncoder`.
Example replications of paper results showing the advantage of referral augmentation and demonstrating a full, concise retrieval and evaluation pipeline can be found in `examples.ipynb`.
### Optional: install with SimCSE support
Note that the only stable version of SimCSE is currently using its [source](https://github.com/princeton-nlp/SimCSE) as a module, which requires building `rar` from source. Thus, optionally to install with support for `SimCSEEncoder`:
```
git clone https://github.com/michaelwilliamtang/referral-augment
cd referral-augment
pip install -r requirements.txt
cd src/rar/encoders
git clone https://github.com/princeton-nlp/SimCSE
cd SimCSE
pip install -r requirements.txt
cd ../../../..
```
## Data
We provide sample data in zipped form [here](https://drive.google.com/file/d/1IVo3sJ-H5i17KdQq4-kBr9oL64KLxtEc/view?usp=sharing) — to use, unzip and place `data/` under the repository's root directory.
Our sample data covers two domains, each with a *corpus* of documents and referrals and an evaluation *dataset* of queries and ground truth documents. Under the `paper_retrieval` domain, we include the `acl`, `acl_small`, and `arxiv` corpuses and datasets, and under the `entity_retrieval` domain, we include the `dbpedia_small` corpus and dataset.
Construction details:
- The `acl_small`, `acl`, and `arxiv` corpuses are constructed from the rich paper metadata parses provided by Allen AI's [S2ORC](https://github.com/allenai/s2orc) project. Documents consist of concatenated paper titles and abstracts from up-to-2017 ACL and ArXiv papers, respectively, and referrals consist of in-text citations between up-to-2017 papers. The respective evaluation datasets are also from the parses, consisting of in-text citations from *2018-and-on* papers citing the up-to-2017 papers in the corpus -- this time-based split prevents data leakage and mirrors deployment conditions.
- The `dbpedia_small` corpus and dataset is sampled from the DBPedia task in the [BEIR](https://github.com/beir-cellar/beir) benchmark. Referrals are mined from Wikipedia HTML using [WikiExtractor](https://github.com/attardi/wikiextractor).
Data can be loaded via our utility functions at `rar.utils`:
```python
from rar.utils import load_corpus, load_eval_dataset
docs, referrals = load_corpus(domain='paper_retrieval', corpus='acl_small')
queries, ground_truth = load_eval_dataset(domain='paper_retrieval', dataset='acl_small')
```
Our data representations are simple and intuitive:
— A `corpus` is a lists of document strings
— A set of `referrals` is a list of lists of document strings (one list of referrals per document)
Similarly:
— A set of `queries` is a lists of query strings
— The corresponding `ground_truth` is *either* a list of document strings (one ground truth document per query, e.g. the cited paper in paper retrieval) *or* a list of lists of document strings (multiple relevant ground truth documents per query, e.g. all relevant Wikipedia pages for a given `dbpedia_small` query)
### Custom data
Creating a corpus is as simple as constructing these lists (referrals are optional). For example:
```python
docs = ['Steve Jobs was a revolutionary technological thinker and designer', "Bill Gates founded the world's largest software company"]
referrals = [['Apple CEO', 'Magic Leap founder'], ['Microsoft CEO', 'The Giving Pledge co-founder']]
retriever = BM25Retriever(docs, referrals)
```
Creating an evaluation corpus is similarly easy:
```python
queries = ['Who built the Apple Macintosh?']
ground_truth = [docs[0]]
```
## Evaluation
We implement the Recall@k and MRR metrics under `rar.metrics`, which can be used standalone or with our utility functions at `rar.utils`:
```python
from rar.utils import evaluate_retriever
evaluate_retriever(retriever, queries, ground_truth)
```
By default, `evaluate_retrieval` attempts to compute MRR, Recall@1, and Recall@10 metrics. The keyword parameter `multiple_correct=False` removes MRR, since it does not support multiple ground truth documents per query (e.g. for `dbpedia_small`). See `examples.ipynb` for example outputs.
If you find this repository helpful, feel free to cite our publication [Referral Augmentation for Zero-Shot Information Retrieval](https://arxiv.org/abs/2305.15098):
```
@misc{tang2023referral,
title={Referral Augmentation for Zero-Shot Information Retrieval},
author={Michael Tang and Shunyu Yao and John Yang and Karthik Narasimhan},
year={2023},
eprint={2305.15098},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
Raw data
{
"_id": null,
"home_page": "",
"name": "referral-augment",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": "",
"keywords": "retrieval",
"author": "",
"author_email": "Michael Tang <mwtang@princeton.edu>",
"download_url": "https://files.pythonhosted.org/packages/68/c6/7729b125749664ca901be49c6ae0552c2bfe5c84933a7ba70ca2e7095e97/referral-augment-0.1.1.tar.gz",
"platform": null,
"description": "# Referral-augmented retrieval (RAR)\n\n## Installation\n\nInstall with pip:\n```\npip install referral-augment\n```\nAlternatively, install from source:\n```\ngit clone https://github.com/michaelwilliamtang/referral-augment\ncd referral-augment\npip install -r requirements.txt\npip install -e .\n```\n\n## Overview\n\nSimple, general implementations of referral-augmented retrieval are provided in `rar.retrievers`. We support three aggregation methods \u2014 concatenation, mean, and shortest path \u2014 as described in the paper, which can be specified via an `AggregationType` constructor argument.\n\nUnder our framework, retrieval with BM25 is as simple as:\n```python\nfrom rar.retrievers import BM25Retriever\nretriever = BM25Retriever(docs, referrals)\nretriever.retrieve('paper that introduced the Flickr30k dataset', num_docs=10)\n```\nSimilarly, retrieval with any dense embedding model on HuggingFace:\n```python\nfrom rar.retrievers import DenseRetriever, AggregationType\nfrom rar.encoders import HuggingFaceEncoder\nencoder = HuggingFaceEncoder('facebook/contriever')\nretriever = DenseRetriever(encoder, docs, referrals, aggregation=AggregationType.MEAN)\nretriever.retrieve('paper that introduced the Flickr30k dataset', num_docs=10)\n```\nFor convenience, we also include direct implementations of the `SimCSEEncoder` and `SpecterEncoder`.\n\nExample replications of paper results showing the advantage of referral augmentation and demonstrating a full, concise retrieval and evaluation pipeline can be found in `examples.ipynb`.\n\n### Optional: install with SimCSE support\n\nNote that the only stable version of SimCSE is currently using its [source](https://github.com/princeton-nlp/SimCSE) as a module, which requires building `rar` from source. Thus, optionally to install with support for `SimCSEEncoder`:\n```\ngit clone https://github.com/michaelwilliamtang/referral-augment\ncd referral-augment\npip install -r requirements.txt\ncd src/rar/encoders\ngit clone https://github.com/princeton-nlp/SimCSE\ncd SimCSE\npip install -r requirements.txt\ncd ../../../..\n```\n\n## Data\n\nWe provide sample data in zipped form [here](https://drive.google.com/file/d/1IVo3sJ-H5i17KdQq4-kBr9oL64KLxtEc/view?usp=sharing) \u2014 to use, unzip and place `data/` under the repository's root directory.\n\nOur sample data covers two domains, each with a *corpus* of documents and referrals and an evaluation *dataset* of queries and ground truth documents. Under the `paper_retrieval` domain, we include the `acl`, `acl_small`, and `arxiv` corpuses and datasets, and under the `entity_retrieval` domain, we include the `dbpedia_small` corpus and dataset.\n\nConstruction details:\n- The `acl_small`, `acl`, and `arxiv` corpuses are constructed from the rich paper metadata parses provided by Allen AI's [S2ORC](https://github.com/allenai/s2orc) project. Documents consist of concatenated paper titles and abstracts from up-to-2017 ACL and ArXiv papers, respectively, and referrals consist of in-text citations between up-to-2017 papers. The respective evaluation datasets are also from the parses, consisting of in-text citations from *2018-and-on* papers citing the up-to-2017 papers in the corpus -- this time-based split prevents data leakage and mirrors deployment conditions.\n- The `dbpedia_small` corpus and dataset is sampled from the DBPedia task in the [BEIR](https://github.com/beir-cellar/beir) benchmark. Referrals are mined from Wikipedia HTML using [WikiExtractor](https://github.com/attardi/wikiextractor).\n\nData can be loaded via our utility functions at `rar.utils`:\n```python\nfrom rar.utils import load_corpus, load_eval_dataset\ndocs, referrals = load_corpus(domain='paper_retrieval', corpus='acl_small')\nqueries, ground_truth = load_eval_dataset(domain='paper_retrieval', dataset='acl_small')\n```\nOur data representations are simple and intuitive:\n\u2014 A `corpus` is a lists of document strings\n\u2014 A set of `referrals` is a list of lists of document strings (one list of referrals per document)\nSimilarly:\n\u2014 A set of `queries` is a lists of query strings\n\u2014 The corresponding `ground_truth` is *either* a list of document strings (one ground truth document per query, e.g. the cited paper in paper retrieval) *or* a list of lists of document strings (multiple relevant ground truth documents per query, e.g. all relevant Wikipedia pages for a given `dbpedia_small` query)\n\n### Custom data\n\nCreating a corpus is as simple as constructing these lists (referrals are optional). For example:\n```python\ndocs = ['Steve Jobs was a revolutionary technological thinker and designer', \"Bill Gates founded the world's largest software company\"]\nreferrals = [['Apple CEO', 'Magic Leap founder'], ['Microsoft CEO', 'The Giving Pledge co-founder']]\n\nretriever = BM25Retriever(docs, referrals)\n```\nCreating an evaluation corpus is similarly easy:\n```python\nqueries = ['Who built the Apple Macintosh?']\nground_truth = [docs[0]]\n```\n\n## Evaluation\n\nWe implement the Recall@k and MRR metrics under `rar.metrics`, which can be used standalone or with our utility functions at `rar.utils`:\n```python\nfrom rar.utils import evaluate_retriever\nevaluate_retriever(retriever, queries, ground_truth)\n```\nBy default, `evaluate_retrieval` attempts to compute MRR, Recall@1, and Recall@10 metrics. The keyword parameter `multiple_correct=False` removes MRR, since it does not support multiple ground truth documents per query (e.g. for `dbpedia_small`). See `examples.ipynb` for example outputs.\n\nIf you find this repository helpful, feel free to cite our publication [Referral Augmentation for Zero-Shot Information Retrieval](https://arxiv.org/abs/2305.15098):\n\n```\n@misc{tang2023referral,\n title={Referral Augmentation for Zero-Shot Information Retrieval}, \n author={Michael Tang and Shunyu Yao and John Yang and Karthik Narasimhan},\n year={2023},\n eprint={2305.15098},\n archivePrefix={arXiv},\n primaryClass={cs.CL}\n}\n```\n",
"bugtrack_url": null,
"license": "MIT License",
"summary": "Official implementation of \"Referral Augmentation for Zero-Shot Information Retrieval\"",
"version": "0.1.1",
"project_urls": {
"Homepage": "https://github.com/michaelwilliamtang/referral-augment"
},
"split_keywords": [
"retrieval"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "8db704d88ff331dbe1583763e485eb7b3ee13bac5e470afafe14903efeae30c0",
"md5": "03b326370d56ff9b7859d76e3f60557d",
"sha256": "b1d97ef04985563d0f5010b517086008aff638ed9a5451bf1c1eacecbbdef58f"
},
"downloads": -1,
"filename": "referral_augment-0.1.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "03b326370d56ff9b7859d76e3f60557d",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 15192,
"upload_time": "2023-09-16T21:25:11",
"upload_time_iso_8601": "2023-09-16T21:25:11.359577Z",
"url": "https://files.pythonhosted.org/packages/8d/b7/04d88ff331dbe1583763e485eb7b3ee13bac5e470afafe14903efeae30c0/referral_augment-0.1.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "68c67729b125749664ca901be49c6ae0552c2bfe5c84933a7ba70ca2e7095e97",
"md5": "0c4e70c14a719c5f48b376fe6d7a0ca5",
"sha256": "b8e9859a872fc6a3fe33afb609c3318dfe8fdd588f3e7fb8c4951825075328b1"
},
"downloads": -1,
"filename": "referral-augment-0.1.1.tar.gz",
"has_sig": false,
"md5_digest": "0c4e70c14a719c5f48b376fe6d7a0ca5",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 13096,
"upload_time": "2023-09-16T21:25:13",
"upload_time_iso_8601": "2023-09-16T21:25:13.117805Z",
"url": "https://files.pythonhosted.org/packages/68/c6/7729b125749664ca901be49c6ae0552c2bfe5c84933a7ba70ca2e7095e97/referral-augment-0.1.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-09-16 21:25:13",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "michaelwilliamtang",
"github_project": "referral-augment",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "referral-augment"
}