<h1 align="center">
🦮 Golden Retriever
</h1>
<p align="center">
<a href="https://pytorch.org/get-started/locally/"><img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-orange?logo=pytorch"></a>
<a href="https://pytorchlightning.ai/"><img alt="Lightning" src="https://img.shields.io/badge/-Lightning-blueviolet"></a>
<a href="https://black.readthedocs.io/en/stable/"><img alt="Code style: black" src="https://img.shields.io/badge/code%20style-black-black.svg"></a>
<a href="https://github.dev/Riccorl/golden-retriever"><img alt="vscode" src="https://img.shields.io/badge/preview%20in-vscode.dev-blue"></a>
</p>
<p align="center">
<a href="https://github.com/Riccorl/golden-retriever/releases"><img alt="release" src="https://img.shields.io/github/v/release/Riccorl/golden-retriever"></a>
<a href="https://github.com/Riccorl/golden-retriever/actions/workflows/python-publish-pypi.yml"><img alt="gh-status" src="https://github.com/Riccorl/golden-retriever/actions/workflows/python-publish-pypi.yml/badge.svg"></a>
</p>
# How to use
Install the library from [PyPi](https://pypi.org/project/goldenretriever-core/):
```bash
pip install goldenretriever-core
```
or from source:
```bash
git clone https://github.com/Riccorl/golden-retriever.git
cd golden-retriever
pip install -e .
```
### FAISS
Install with optional dependencies for [FAISS](https://github.com/facebookresearch/faiss)
FAISS pypi package is only available for CPU. If you want to use GPU, you need to install it from source or use the conda package.
For CPU:
```bash
pip install goldenretriever-core[faiss]
```
For GPU:
```bash
conda create -n goldenretriever python=3.11
conda activate goldenretriever
# install pytorch
conda install -y pytorch=2.3.1 pytorch-cuda=12.1 -c pytorch -c nvidia
# GPU
conda install -y -c pytorch -c nvidia faiss-gpu=1.8.0
# or GPU with NVIDIA RAFT
conda install -y -c pytorch -c nvidia -c rapidsai -c conda-forge faiss-gpu-raft=1.8.0
pip install goldenretriever-core
```
# Usage
Golden Retriever is built on top of PyTorch Lightning and Hydra. To run an experiment, you need to create a configuration file and pass
it to the `golden-retriever` command. Few examples are provided in the `conf` folder.
## Training
Here a simple example on how to train a DPR-like Retriever on the NQ dataset.
First download the dataset from [DPR](https://github.com/facebookresearch/DPR?tab=readme-ov-file#retriever-input-data-format). The run the following code:
```bash
golden-retriever train conf/nq-dpr.yaml
```
## Evaluation
```python
from goldenretriever.trainer import Trainer
from goldenretriever import GoldenRetriever
from goldenretriever.data.datasets import InBatchNegativesDataset
retriever = GoldenRetriever(
question_encoder="",
document_index="",
device="cuda",
precision="16",
)
test_dataset = InBatchNegativesDataset(
name="test",
path="",
tokenizer=retriever.question_tokenizer,
question_batch_size=64,
passage_batch_size=400,
max_passage_length=64,
)
trainer = Trainer(
retriever=retriever,
test_dataset=test_dataset,
log_to_wandb=False,
top_k=[20, 100]
)
trainer.test()
```
### Distributed environment
Golden Retriever supports distributed training. For the moment, it is only possible to train on a single node with multiple GPUs and without model sharding, i.e.
only DDP and FSDP with `NO_SHARD` strategy are supported.
To run a distributed training, just add the following keys to the configuration file:
```yaml
devices: 4 # number of GPUs
# strategy: "ddp_find_unused_parameters_true" # DDP
# FSDP with NO_SHARD
strategy:
_target_: lightning.pytorch.strategies.FSDPStrategy
sharding_strategy: "NO_SHARD"
```
## Inference
```python
from goldenretriever import GoldenRetriever
retriever = GoldenRetriever(
question_encoder="path/to/question/encoder",
passage_encoder="path/to/passage/encoder",
document_index="path/to/document/index"
)
# retrieve documents
retriever.retrieve("What is the capital of France?", k=5)
```
## Data format
### Input data
The retriever expects a jsonl file similar to [DPR](https://github.com/facebookresearch/DPR):
```json lines
[
{
"question": "....",
"answers": ["...", "...", "..."],
"positive_ctxs": [{
"title": "...",
"text": "...."
}],
"negative_ctxs": ["..."],
"hard_negative_ctxs": ["..."]
},
...
]
```
### Index data
The document to index can be either a jsonl file or a tsv file similar to
[DPR](https://github.com/facebookresearch/DPR):
- `jsonl`: each line is a json object with the following keys: `id`, `text`, `metadata`
- `tsv`: each line is a tab-separated string with the `id` and `text` column,
followed by any other column that will be stored in the `metadata` field
jsonl example:
```json lines
[
{
"id": "...",
"text": "...",
"metadata": ["{...}"]
},
...
]
```
tsv example:
```tsv
id \t text \t any other column
...
```
Raw data
{
"_id": null,
"home_page": "https://github.com/Riccorl/golden-retriever",
"name": "goldenretriever-core",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "NLP deep learning transformer pytorch retriever rag dpr",
"author": "Riccardo Orlando",
"author_email": "orlandorcc@gmail.com",
"download_url": null,
"platform": null,
"description": "<h1 align=\"center\">\n \ud83e\uddae Golden Retriever\n</h1>\n\n<p align=\"center\">\n <a href=\"https://pytorch.org/get-started/locally/\"><img alt=\"PyTorch\" src=\"https://img.shields.io/badge/PyTorch-orange?logo=pytorch\"></a>\n <a href=\"https://pytorchlightning.ai/\"><img alt=\"Lightning\" src=\"https://img.shields.io/badge/-Lightning-blueviolet\"></a>\n <a href=\"https://black.readthedocs.io/en/stable/\"><img alt=\"Code style: black\" src=\"https://img.shields.io/badge/code%20style-black-black.svg\"></a>\n <a href=\"https://github.dev/Riccorl/golden-retriever\"><img alt=\"vscode\" src=\"https://img.shields.io/badge/preview%20in-vscode.dev-blue\"></a>\n</p>\n<p align=\"center\">\n <a href=\"https://github.com/Riccorl/golden-retriever/releases\"><img alt=\"release\" src=\"https://img.shields.io/github/v/release/Riccorl/golden-retriever\"></a>\n <a href=\"https://github.com/Riccorl/golden-retriever/actions/workflows/python-publish-pypi.yml\"><img alt=\"gh-status\" src=\"https://github.com/Riccorl/golden-retriever/actions/workflows/python-publish-pypi.yml/badge.svg\"></a>\n\n</p>\n\n# How to use\n\nInstall the library from [PyPi](https://pypi.org/project/goldenretriever-core/):\n\n```bash\npip install goldenretriever-core\n```\n\nor from source:\n\n```bash\ngit clone https://github.com/Riccorl/golden-retriever.git\ncd golden-retriever\npip install -e .\n```\n\n### FAISS\n\nInstall with optional dependencies for [FAISS](https://github.com/facebookresearch/faiss)\n\nFAISS pypi package is only available for CPU. If you want to use GPU, you need to install it from source or use the conda package.\n\nFor CPU:\n\n```bash\npip install goldenretriever-core[faiss]\n```\n\nFor GPU:\n\n```bash\nconda create -n goldenretriever python=3.11\nconda activate goldenretriever\n\n# install pytorch\nconda install -y pytorch=2.3.1 pytorch-cuda=12.1 -c pytorch -c nvidia\n\n# GPU\nconda install -y -c pytorch -c nvidia faiss-gpu=1.8.0\n# or GPU with NVIDIA RAFT\nconda install -y -c pytorch -c nvidia -c rapidsai -c conda-forge faiss-gpu-raft=1.8.0\n\npip install goldenretriever-core\n```\n\n# Usage\n\nGolden Retriever is built on top of PyTorch Lightning and Hydra. To run an experiment, you need to create a configuration file and pass \nit to the `golden-retriever` command. Few examples are provided in the `conf` folder.\n\n## Training\n\nHere a simple example on how to train a DPR-like Retriever on the NQ dataset.\nFirst download the dataset from [DPR](https://github.com/facebookresearch/DPR?tab=readme-ov-file#retriever-input-data-format). The run the following code:\n\n```bash\ngolden-retriever train conf/nq-dpr.yaml\n```\n\n## Evaluation\n\n```python\nfrom goldenretriever.trainer import Trainer\nfrom goldenretriever import GoldenRetriever\nfrom goldenretriever.data.datasets import InBatchNegativesDataset\n\nretriever = GoldenRetriever(\n question_encoder=\"\",\n document_index=\"\",\n device=\"cuda\",\n precision=\"16\",\n)\n\ntest_dataset = InBatchNegativesDataset(\n name=\"test\",\n path=\"\",\n tokenizer=retriever.question_tokenizer,\n question_batch_size=64,\n passage_batch_size=400,\n max_passage_length=64,\n)\n\ntrainer = Trainer(\n retriever=retriever,\n test_dataset=test_dataset,\n log_to_wandb=False,\n top_k=[20, 100]\n)\n\ntrainer.test()\n```\n\n### Distributed environment\n\nGolden Retriever supports distributed training. For the moment, it is only possible to train on a single node with multiple GPUs and without model sharding, i.e.\nonly DDP and FSDP with `NO_SHARD` strategy are supported.\n\nTo run a distributed training, just add the following keys to the configuration file:\n\n```yaml\ndevices: 4 # number of GPUs\n# strategy: \"ddp_find_unused_parameters_true\" # DDP\n# FSDP with NO_SHARD\nstrategy:\n _target_: lightning.pytorch.strategies.FSDPStrategy\n sharding_strategy: \"NO_SHARD\"\n```\n\n## Inference\n\n```python\nfrom goldenretriever import GoldenRetriever\n\nretriever = GoldenRetriever(\n question_encoder=\"path/to/question/encoder\",\n passage_encoder=\"path/to/passage/encoder\",\n document_index=\"path/to/document/index\"\n)\n\n# retrieve documents\nretriever.retrieve(\"What is the capital of France?\", k=5)\n```\n\n## Data format\n\n### Input data\n\nThe retriever expects a jsonl file similar to [DPR](https://github.com/facebookresearch/DPR):\n\n```json lines\n[\n {\n \"question\": \"....\",\n \"answers\": [\"...\", \"...\", \"...\"],\n \"positive_ctxs\": [{\n \"title\": \"...\",\n \"text\": \"....\"\n }],\n \"negative_ctxs\": [\"...\"],\n \"hard_negative_ctxs\": [\"...\"]\n },\n ...\n]\n```\n\n### Index data\n\nThe document to index can be either a jsonl file or a tsv file similar to\n[DPR](https://github.com/facebookresearch/DPR):\n\n- `jsonl`: each line is a json object with the following keys: `id`, `text`, `metadata`\n- `tsv`: each line is a tab-separated string with the `id` and `text` column,\n followed by any other column that will be stored in the `metadata` field\n\njsonl example:\n\n```json lines\n[\n {\n \"id\": \"...\",\n \"text\": \"...\",\n \"metadata\": [\"{...}\"]\n },\n ...\n]\n```\n\ntsv example:\n\n```tsv\nid \\t text \\t any other column\n...\n```\n",
"bugtrack_url": null,
"license": "Apache",
"summary": "Dense Retriever",
"version": "1.0.0",
"project_urls": {
"Homepage": "https://github.com/Riccorl/golden-retriever"
},
"split_keywords": [
"nlp",
"deep",
"learning",
"transformer",
"pytorch",
"retriever",
"rag",
"dpr"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "a84858860e57192aa677a8a77579467eaa441c487ce063a57f841431534f2582",
"md5": "d7499ee1e5ca822123da05dfe19daaae",
"sha256": "34eacfd41ddbef43e9ed13443f24b52e2be756e56497ffa4da9d76f3171cd273"
},
"downloads": -1,
"filename": "goldenretriever_core-1.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "d7499ee1e5ca822123da05dfe19daaae",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 125881,
"upload_time": "2024-07-09T12:22:51",
"upload_time_iso_8601": "2024-07-09T12:22:51.200961Z",
"url": "https://files.pythonhosted.org/packages/a8/48/58860e57192aa677a8a77579467eaa441c487ce063a57f841431534f2582/goldenretriever_core-1.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-07-09 12:22:51",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Riccorl",
"github_project": "golden-retriever",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "goldenretriever-core"
}