<h1 align="center">
🦮 Golden Retriever
</h1>
<p align="center">
<a href="https://pytorch.org/get-started/locally/"><img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-orange?logo=pytorch"></a>
<a href="https://pytorchlightning.ai/"><img alt="Lightning" src="https://img.shields.io/badge/-Lightning-blueviolet"></a>
<a href="https://black.readthedocs.io/en/stable/"><img alt="Code style: black" src="https://img.shields.io/badge/code%20style-black-black.svg"></a>
<a href="https://github.dev/Riccorl/golden-retriever"><img alt="vscode" src="https://img.shields.io/badge/preview%20in-vscode.dev-blue"></a>
</p>
<p align="center">
<a href="https://github.com/Riccorl/golden-retriever/releases"><img alt="release" src="https://img.shields.io/github/v/release/Riccorl/golden-retriever"></a>
<a href="https://github.com/Riccorl/golden-retriever/actions/workflows/python-publish-pypi.yml"><img alt="gh-status" src="https://github.com/Riccorl/golden-retriever/actions/workflows/python-publish-pypi.yml/badge.svg"></a>
</p>
# How to use
Install the library from [PyPI]():
```bash
pip install goldenretriever
```
or from source:
```bash
git clone https://github.com/Riccorl/golden-retriever.git
cd goldenretriever
pip install -e .
```
# Usage
## How to run an experiment
### Training
Here a simple example on how to train a DPR-like Retriever on the NQ dataset.
First download the dataset from (DPR)[]. The run the following code:
```python
from goldenretriever.trainer import Trainer
from goldenretriever import GoldenRetriever
from goldenretriever.data.datasets import InBatchNegativesDataset
# create a retriever
retriever = GoldenRetriever(
question_encoder="intfloat/e5-small-v2",
passage_encoder="intfloat/e5-small-v2"
)
# create a dataset
train_dataset = InBatchNegativesDataset(
name="webq_train",
path="path/to/webq_train.json",
tokenizer=retriever.question_tokenizer,
question_batch_size=64,
passage_batch_size=400,
max_passage_length=64,
shuffle=True,
)
val_dataset = InBatchNegativesDataset(
name="webq_dev",
path="path/to/webq_dev.json",
tokenizer=retriever.question_tokenizer,
question_batch_size=64,
passage_batch_size=400,
max_passage_length=64,
)
trainer = Trainer(
retriever=retriever,
train_dataset=train_dataset,
val_dataset=val_dataset,
max_steps=25_000,
wandb_online_mode=True,
wandb_project_name="golden-retriever-dpr",
wandb_experiment_name="e5-small-webq",
max_hard_negatives_to_mine=5,
)
# start training
trainer.train()
```
### Evaluation
```python
from goldenretriever.trainer import Trainer
from goldenretriever import GoldenRetriever
from goldenretriever.data.datasets import InBatchNegativesDataset
retriever = GoldenRetriever(
question_encoder="",
document_index="",
device="cuda",
precision="16",
)
test_dataset = InBatchNegativesDataset(
name="test",
path="",
tokenizer=retriever.question_tokenizer,
question_batch_size=64,
passage_batch_size=400,
max_passage_length=64,
)
trainer = Trainer(
retriever=retriever,
test_dataset=test_dataset,
log_to_wandb=False,
top_k=[20, 100]
)
trainer.test()
```
## Inference
```python
from goldenretriever import GoldenRetriever
retriever = GoldenRetriever(
question_encoder="path/to/question/encoder",
passage_encoder="path/to/passage/encoder",
document_index="path/to/document/index"
)
# retrieve documents
retriever.retrieve("What is the capital of France?", k=5)
```
## Data format
### Input data
The retriever expects a jsonl file similar to [DPR](https://github.com/facebookresearch/DPR):
```json lines
[
{
"question": "....",
"answers": ["...", "...", "..."],
"positive_ctxs": [{
"title": "...",
"text": "...."
}],
"negative_ctxs": ["..."],
"hard_negative_ctxs": ["..."]
},
...
]
```
### Index data
The document to index can be either a jsonl file or a tsv file similar to
[DPR](https://github.com/facebookresearch/DPR):
- `jsonl`: each line is a json object with the following keys: `id`, `text`, `metadata`
- `tsv`: each line is a tab-separated string with the `id` and `text` column,
followed by any other column that will be stored in the `metadata` field
jsonl example:
```json lines
[
{
"id": "...",
"text": "...",
"metadata": ["{...}"]
},
...
]
```
tsv example:
```tsv
id \t text \t any other column
...
```
Raw data
{
"_id": null,
"home_page": "https://github.com/Riccorl/golden-retriever",
"name": "goldenretriever-core",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": "",
"keywords": "NLP deep learning transformer pytorch retriever rag dpr",
"author": "Riccardo Orlando",
"author_email": "orlandorcc@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/32/91/00ae709416a8bc22ac6559b38c6e9cc61272768f0bd8c68404041673ffc9/goldenretriever-core-0.9.0.tar.gz",
"platform": null,
"description": "<h1 align=\"center\">\n \ud83e\uddae Golden Retriever\n</h1>\n\n<p align=\"center\">\n <a href=\"https://pytorch.org/get-started/locally/\"><img alt=\"PyTorch\" src=\"https://img.shields.io/badge/PyTorch-orange?logo=pytorch\"></a>\n <a href=\"https://pytorchlightning.ai/\"><img alt=\"Lightning\" src=\"https://img.shields.io/badge/-Lightning-blueviolet\"></a>\n <a href=\"https://black.readthedocs.io/en/stable/\"><img alt=\"Code style: black\" src=\"https://img.shields.io/badge/code%20style-black-black.svg\"></a>\n <a href=\"https://github.dev/Riccorl/golden-retriever\"><img alt=\"vscode\" src=\"https://img.shields.io/badge/preview%20in-vscode.dev-blue\"></a>\n</p>\n<p align=\"center\">\n <a href=\"https://github.com/Riccorl/golden-retriever/releases\"><img alt=\"release\" src=\"https://img.shields.io/github/v/release/Riccorl/golden-retriever\"></a>\n <a href=\"https://github.com/Riccorl/golden-retriever/actions/workflows/python-publish-pypi.yml\"><img alt=\"gh-status\" src=\"https://github.com/Riccorl/golden-retriever/actions/workflows/python-publish-pypi.yml/badge.svg\"></a>\n\n</p>\n\n# How to use\n\nInstall the library from [PyPI]():\n\n```bash\npip install goldenretriever\n```\n\nor from source:\n\n```bash\ngit clone https://github.com/Riccorl/golden-retriever.git\ncd goldenretriever\npip install -e .\n```\n\n# Usage\n\n## How to run an experiment\n\n### Training\n\nHere a simple example on how to train a DPR-like Retriever on the NQ dataset.\nFirst download the dataset from (DPR)[]. The run the following code:\n\n```python\nfrom goldenretriever.trainer import Trainer\nfrom goldenretriever import GoldenRetriever\nfrom goldenretriever.data.datasets import InBatchNegativesDataset\n\n# create a retriever\nretriever = GoldenRetriever(\n question_encoder=\"intfloat/e5-small-v2\",\n passage_encoder=\"intfloat/e5-small-v2\"\n)\n\n# create a dataset\ntrain_dataset = InBatchNegativesDataset(\n name=\"webq_train\",\n path=\"path/to/webq_train.json\",\n tokenizer=retriever.question_tokenizer,\n question_batch_size=64,\n passage_batch_size=400,\n max_passage_length=64,\n shuffle=True,\n)\nval_dataset = InBatchNegativesDataset(\n name=\"webq_dev\",\n path=\"path/to/webq_dev.json\",\n tokenizer=retriever.question_tokenizer,\n question_batch_size=64,\n passage_batch_size=400,\n max_passage_length=64,\n)\n\ntrainer = Trainer(\n retriever=retriever,\n train_dataset=train_dataset,\n val_dataset=val_dataset,\n max_steps=25_000,\n wandb_online_mode=True,\n wandb_project_name=\"golden-retriever-dpr\",\n wandb_experiment_name=\"e5-small-webq\",\n max_hard_negatives_to_mine=5,\n)\n\n# start training\ntrainer.train()\n```\n\n### Evaluation\n\n```python\nfrom goldenretriever.trainer import Trainer\nfrom goldenretriever import GoldenRetriever\nfrom goldenretriever.data.datasets import InBatchNegativesDataset\n\nretriever = GoldenRetriever(\n question_encoder=\"\",\n document_index=\"\",\n device=\"cuda\",\n precision=\"16\",\n)\n\ntest_dataset = InBatchNegativesDataset(\n name=\"test\",\n path=\"\",\n tokenizer=retriever.question_tokenizer,\n question_batch_size=64,\n passage_batch_size=400,\n max_passage_length=64,\n)\n\ntrainer = Trainer(\n retriever=retriever,\n test_dataset=test_dataset,\n log_to_wandb=False,\n top_k=[20, 100]\n)\n\ntrainer.test()\n```\n\n## Inference\n\n```python\nfrom goldenretriever import GoldenRetriever\n\nretriever = GoldenRetriever(\n question_encoder=\"path/to/question/encoder\",\n passage_encoder=\"path/to/passage/encoder\",\n document_index=\"path/to/document/index\"\n)\n\n# retrieve documents\nretriever.retrieve(\"What is the capital of France?\", k=5)\n```\n\n## Data format\n\n### Input data\n\nThe retriever expects a jsonl file similar to [DPR](https://github.com/facebookresearch/DPR):\n\n```json lines\n[\n {\n\t\"question\": \"....\",\n\t\"answers\": [\"...\", \"...\", \"...\"],\n\t\"positive_ctxs\": [{\n\t\t\"title\": \"...\",\n\t\t\"text\": \"....\"\n\t}],\n\t\"negative_ctxs\": [\"...\"],\n\t\"hard_negative_ctxs\": [\"...\"]\n },\n ...\n]\n```\n\n### Index data\n\nThe document to index can be either a jsonl file or a tsv file similar to \n[DPR](https://github.com/facebookresearch/DPR):\n\n- `jsonl`: each line is a json object with the following keys: `id`, `text`, `metadata`\n- `tsv`: each line is a tab-separated string with the `id` and `text` column, \n followed by any other column that will be stored in the `metadata` field\n\njsonl example:\n\n```json lines\n[\n {\n \"id\": \"...\",\n \"text\": \"...\",\n \"metadata\": [\"{...}\"]\n },\n ...\n]\n```\n\ntsv example:\n\n```tsv\nid \\t text \\t any other column\n...\n```\n\n",
"bugtrack_url": null,
"license": "Apache",
"summary": "Dense Retriever",
"version": "0.9.0",
"project_urls": {
"Homepage": "https://github.com/Riccorl/golden-retriever"
},
"split_keywords": [
"nlp",
"deep",
"learning",
"transformer",
"pytorch",
"retriever",
"rag",
"dpr"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "c7e3f9dec231d20f0fb694019a2de108ef56ba0336c0bae04829c00d861c50e1",
"md5": "f89b95e103ba8b7b348b0a58e974bcee",
"sha256": "568578363ee88a4312b1943365e5c7930552a4f96842da9ef70dc15afcafbba5"
},
"downloads": -1,
"filename": "goldenretriever_core-0.9.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "f89b95e103ba8b7b348b0a58e974bcee",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 105000,
"upload_time": "2024-02-07T18:19:09",
"upload_time_iso_8601": "2024-02-07T18:19:09.060754Z",
"url": "https://files.pythonhosted.org/packages/c7/e3/f9dec231d20f0fb694019a2de108ef56ba0336c0bae04829c00d861c50e1/goldenretriever_core-0.9.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "329100ae709416a8bc22ac6559b38c6e9cc61272768f0bd8c68404041673ffc9",
"md5": "f2d5d92d1e9358aff29e7a6d8b6c20df",
"sha256": "50efcc71df053b4f7fd4e826f3b938867fd99f8ec6fdcaaaf95660347bdc7287"
},
"downloads": -1,
"filename": "goldenretriever-core-0.9.0.tar.gz",
"has_sig": false,
"md5_digest": "f2d5d92d1e9358aff29e7a6d8b6c20df",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 79090,
"upload_time": "2024-02-07T18:19:11",
"upload_time_iso_8601": "2024-02-07T18:19:11.275896Z",
"url": "https://files.pythonhosted.org/packages/32/91/00ae709416a8bc22ac6559b38c6e9cc61272768f0bd8c68404041673ffc9/goldenretriever-core-0.9.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-02-07 18:19:11",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Riccorl",
"github_project": "golden-retriever",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "goldenretriever-core"
}