goldenretriever-core

Name	goldenretriever-core JSON
Version	1.0.0 JSON
	download
home_page	https://github.com/Riccorl/golden-retriever
Summary	Dense Retriever
upload_time	2024-07-09 12:22:51
maintainer	None
docs_url	None
author	Riccardo Orlando
requires_python	>=3.10
license	Apache
keywords	nlp deep learning transformer pytorch retriever rag dpr
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <h1 align="center">
  🦮 Golden Retriever
</h1>

<p align="center">
  <a href="https://pytorch.org/get-started/locally/"><img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-orange?logo=pytorch"></a>
  <a href="https://pytorchlightning.ai/"><img alt="Lightning" src="https://img.shields.io/badge/-Lightning-blueviolet"></a>
  <a href="https://black.readthedocs.io/en/stable/"><img alt="Code style: black" src="https://img.shields.io/badge/code%20style-black-black.svg"></a>
  <a href="https://github.dev/Riccorl/golden-retriever"><img alt="vscode" src="https://img.shields.io/badge/preview%20in-vscode.dev-blue"></a>
</p>
<p align="center">
  <a href="https://github.com/Riccorl/golden-retriever/releases"><img alt="release" src="https://img.shields.io/github/v/release/Riccorl/golden-retriever"></a>
  <a href="https://github.com/Riccorl/golden-retriever/actions/workflows/python-publish-pypi.yml"><img alt="gh-status" src="https://github.com/Riccorl/golden-retriever/actions/workflows/python-publish-pypi.yml/badge.svg"></a>

</p>

# How to use

Install the library from [PyPi](https://pypi.org/project/goldenretriever-core/):

```bash
pip install goldenretriever-core
```

or from source:

```bash
git clone https://github.com/Riccorl/golden-retriever.git
cd golden-retriever
pip install -e .
```

### FAISS

Install with optional dependencies for [FAISS](https://github.com/facebookresearch/faiss)

FAISS pypi package is only available for CPU. If you want to use GPU, you need to install it from source or use the conda package.

For CPU:

```bash
pip install goldenretriever-core[faiss]
```

For GPU:

```bash
conda create -n goldenretriever python=3.11
conda activate goldenretriever

# install pytorch
conda install -y pytorch=2.3.1 pytorch-cuda=12.1 -c pytorch -c nvidia

# GPU
conda install -y -c pytorch -c nvidia faiss-gpu=1.8.0
# or GPU with NVIDIA RAFT
conda install -y -c pytorch -c nvidia -c rapidsai -c conda-forge faiss-gpu-raft=1.8.0

pip install goldenretriever-core
```

# Usage

Golden Retriever is built on top of PyTorch Lightning and Hydra. To run an experiment, you need to create a configuration file and pass 
it to the `golden-retriever` command. Few examples are provided in the `conf` folder.

## Training

Here a simple example on how to train a DPR-like Retriever on the NQ dataset.
First download the dataset from [DPR](https://github.com/facebookresearch/DPR?tab=readme-ov-file#retriever-input-data-format). The run the following code:

```bash
golden-retriever train conf/nq-dpr.yaml
```

## Evaluation

```python
from goldenretriever.trainer import Trainer
from goldenretriever import GoldenRetriever
from goldenretriever.data.datasets import InBatchNegativesDataset

retriever = GoldenRetriever(
  question_encoder="",
  document_index="",
  device="cuda",
  precision="16",
)

test_dataset = InBatchNegativesDataset(
  name="test",
  path="",
  tokenizer=retriever.question_tokenizer,
  question_batch_size=64,
  passage_batch_size=400,
  max_passage_length=64,
)

trainer = Trainer(
  retriever=retriever,
  test_dataset=test_dataset,
  log_to_wandb=False,
  top_k=[20, 100]
)

trainer.test()
```

### Distributed environment

Golden Retriever supports distributed training. For the moment, it is only possible to train on a single node with multiple GPUs and without model sharding, i.e.
only DDP and FSDP with `NO_SHARD` strategy are supported.

To run a distributed training, just add the following keys to the configuration file:

```yaml
devices: 4  # number of GPUs
# strategy: "ddp_find_unused_parameters_true"  # DDP
# FSDP with NO_SHARD
strategy:
  _target_: lightning.pytorch.strategies.FSDPStrategy
  sharding_strategy: "NO_SHARD"
```

## Inference

```python
from goldenretriever import GoldenRetriever

retriever = GoldenRetriever(
    question_encoder="path/to/question/encoder",
    passage_encoder="path/to/passage/encoder",
    document_index="path/to/document/index"
)

# retrieve documents
retriever.retrieve("What is the capital of France?", k=5)
```

## Data format

### Input data

The retriever expects a jsonl file similar to [DPR](https://github.com/facebookresearch/DPR):

```json lines
[
  {
  "question": "....",
  "answers": ["...", "...", "..."],
  "positive_ctxs": [{
    "title": "...",
    "text": "...."
  }],
  "negative_ctxs": ["..."],
  "hard_negative_ctxs": ["..."]
  },
  ...
]
```

### Index data

The document to index can be either a jsonl file or a tsv file similar to
[DPR](https://github.com/facebookresearch/DPR):

- `jsonl`: each line is a json object with the following keys: `id`, `text`, `metadata`
- `tsv`: each line is a tab-separated string with the `id` and `text` column,
  followed by any other column that will be stored in the `metadata` field

jsonl example:

```json lines
[
  {
    "id": "...",
    "text": "...",
    "metadata": ["{...}"]
  },
  ...
]
```

tsv example:

```tsv
id \t text \t any other column
...
```

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Riccorl/golden-retriever",
    "name": "goldenretriever-core",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "NLP deep learning transformer pytorch retriever rag dpr",
    "author": "Riccardo Orlando",
    "author_email": "orlandorcc@gmail.com",
    "download_url": null,
    "platform": null,
    "description": "<h1 align=\"center\">\n  \ud83e\uddae Golden Retriever\n</h1>\n\n<p align=\"center\">\n  <a href=\"https://pytorch.org/get-started/locally/\"><img alt=\"PyTorch\" src=\"https://img.shields.io/badge/PyTorch-orange?logo=pytorch\"></a>\n  <a href=\"https://pytorchlightning.ai/\"><img alt=\"Lightning\" src=\"https://img.shields.io/badge/-Lightning-blueviolet\"></a>\n  <a href=\"https://black.readthedocs.io/en/stable/\"><img alt=\"Code style: black\" src=\"https://img.shields.io/badge/code%20style-black-black.svg\"></a>\n  <a href=\"https://github.dev/Riccorl/golden-retriever\"><img alt=\"vscode\" src=\"https://img.shields.io/badge/preview%20in-vscode.dev-blue\"></a>\n</p>\n<p align=\"center\">\n  <a href=\"https://github.com/Riccorl/golden-retriever/releases\"><img alt=\"release\" src=\"https://img.shields.io/github/v/release/Riccorl/golden-retriever\"></a>\n  <a href=\"https://github.com/Riccorl/golden-retriever/actions/workflows/python-publish-pypi.yml\"><img alt=\"gh-status\" src=\"https://github.com/Riccorl/golden-retriever/actions/workflows/python-publish-pypi.yml/badge.svg\"></a>\n\n</p>\n\n# How to use\n\nInstall the library from [PyPi](https://pypi.org/project/goldenretriever-core/):\n\n```bash\npip install goldenretriever-core\n```\n\nor from source:\n\n```bash\ngit clone https://github.com/Riccorl/golden-retriever.git\ncd golden-retriever\npip install -e .\n```\n\n### FAISS\n\nInstall with optional dependencies for [FAISS](https://github.com/facebookresearch/faiss)\n\nFAISS pypi package is only available for CPU. If you want to use GPU, you need to install it from source or use the conda package.\n\nFor CPU:\n\n```bash\npip install goldenretriever-core[faiss]\n```\n\nFor GPU:\n\n```bash\nconda create -n goldenretriever python=3.11\nconda activate goldenretriever\n\n# install pytorch\nconda install -y pytorch=2.3.1 pytorch-cuda=12.1 -c pytorch -c nvidia\n\n# GPU\nconda install -y -c pytorch -c nvidia faiss-gpu=1.8.0\n# or GPU with NVIDIA RAFT\nconda install -y -c pytorch -c nvidia -c rapidsai -c conda-forge faiss-gpu-raft=1.8.0\n\npip install goldenretriever-core\n```\n\n# Usage\n\nGolden Retriever is built on top of PyTorch Lightning and Hydra. To run an experiment, you need to create a configuration file and pass \nit to the `golden-retriever` command. Few examples are provided in the `conf` folder.\n\n## Training\n\nHere a simple example on how to train a DPR-like Retriever on the NQ dataset.\nFirst download the dataset from [DPR](https://github.com/facebookresearch/DPR?tab=readme-ov-file#retriever-input-data-format). The run the following code:\n\n```bash\ngolden-retriever train conf/nq-dpr.yaml\n```\n\n## Evaluation\n\n```python\nfrom goldenretriever.trainer import Trainer\nfrom goldenretriever import GoldenRetriever\nfrom goldenretriever.data.datasets import InBatchNegativesDataset\n\nretriever = GoldenRetriever(\n  question_encoder=\"\",\n  document_index=\"\",\n  device=\"cuda\",\n  precision=\"16\",\n)\n\ntest_dataset = InBatchNegativesDataset(\n  name=\"test\",\n  path=\"\",\n  tokenizer=retriever.question_tokenizer,\n  question_batch_size=64,\n  passage_batch_size=400,\n  max_passage_length=64,\n)\n\ntrainer = Trainer(\n  retriever=retriever,\n  test_dataset=test_dataset,\n  log_to_wandb=False,\n  top_k=[20, 100]\n)\n\ntrainer.test()\n```\n\n### Distributed environment\n\nGolden Retriever supports distributed training. For the moment, it is only possible to train on a single node with multiple GPUs and without model sharding, i.e.\nonly DDP and FSDP with `NO_SHARD` strategy are supported.\n\nTo run a distributed training, just add the following keys to the configuration file:\n\n```yaml\ndevices: 4  # number of GPUs\n# strategy: \"ddp_find_unused_parameters_true\"  # DDP\n# FSDP with NO_SHARD\nstrategy:\n  _target_: lightning.pytorch.strategies.FSDPStrategy\n  sharding_strategy: \"NO_SHARD\"\n```\n\n## Inference\n\n```python\nfrom goldenretriever import GoldenRetriever\n\nretriever = GoldenRetriever(\n    question_encoder=\"path/to/question/encoder\",\n    passage_encoder=\"path/to/passage/encoder\",\n    document_index=\"path/to/document/index\"\n)\n\n# retrieve documents\nretriever.retrieve(\"What is the capital of France?\", k=5)\n```\n\n## Data format\n\n### Input data\n\nThe retriever expects a jsonl file similar to [DPR](https://github.com/facebookresearch/DPR):\n\n```json lines\n[\n  {\n  \"question\": \"....\",\n  \"answers\": [\"...\", \"...\", \"...\"],\n  \"positive_ctxs\": [{\n    \"title\": \"...\",\n    \"text\": \"....\"\n  }],\n  \"negative_ctxs\": [\"...\"],\n  \"hard_negative_ctxs\": [\"...\"]\n  },\n  ...\n]\n```\n\n### Index data\n\nThe document to index can be either a jsonl file or a tsv file similar to\n[DPR](https://github.com/facebookresearch/DPR):\n\n- `jsonl`: each line is a json object with the following keys: `id`, `text`, `metadata`\n- `tsv`: each line is a tab-separated string with the `id` and `text` column,\n  followed by any other column that will be stored in the `metadata` field\n\njsonl example:\n\n```json lines\n[\n  {\n    \"id\": \"...\",\n    \"text\": \"...\",\n    \"metadata\": [\"{...}\"]\n  },\n  ...\n]\n```\n\ntsv example:\n\n```tsv\nid \\t text \\t any other column\n...\n```\n",
    "bugtrack_url": null,
    "license": "Apache",
    "summary": "Dense Retriever",
    "version": "1.0.0",
    "project_urls": {
        "Homepage": "https://github.com/Riccorl/golden-retriever"
    },
    "split_keywords": [
        "nlp",
        "deep",
        "learning",
        "transformer",
        "pytorch",
        "retriever",
        "rag",
        "dpr"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a84858860e57192aa677a8a77579467eaa441c487ce063a57f841431534f2582",
                "md5": "d7499ee1e5ca822123da05dfe19daaae",
                "sha256": "34eacfd41ddbef43e9ed13443f24b52e2be756e56497ffa4da9d76f3171cd273"
            },
            "downloads": -1,
            "filename": "goldenretriever_core-1.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d7499ee1e5ca822123da05dfe19daaae",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 125881,
            "upload_time": "2024-07-09T12:22:51",
            "upload_time_iso_8601": "2024-07-09T12:22:51.200961Z",
            "url": "https://files.pythonhosted.org/packages/a8/48/58860e57192aa677a8a77579467eaa441c487ce063a57f841431534f2582/goldenretriever_core-1.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-07-09 12:22:51",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Riccorl",
    "github_project": "golden-retriever",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "goldenretriever-core"
}

Riccardo Orlando