goldenretriever-core


Namegoldenretriever-core JSON
Version 0.9.0 PyPI version JSON
download
home_pagehttps://github.com/Riccorl/golden-retriever
SummaryDense Retriever
upload_time2024-02-07 18:19:11
maintainer
docs_urlNone
authorRiccardo Orlando
requires_python>=3.10
licenseApache
keywords nlp deep learning transformer pytorch retriever rag dpr
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <h1 align="center">
  🦮 Golden Retriever
</h1>

<p align="center">
  <a href="https://pytorch.org/get-started/locally/"><img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-orange?logo=pytorch"></a>
  <a href="https://pytorchlightning.ai/"><img alt="Lightning" src="https://img.shields.io/badge/-Lightning-blueviolet"></a>
  <a href="https://black.readthedocs.io/en/stable/"><img alt="Code style: black" src="https://img.shields.io/badge/code%20style-black-black.svg"></a>
  <a href="https://github.dev/Riccorl/golden-retriever"><img alt="vscode" src="https://img.shields.io/badge/preview%20in-vscode.dev-blue"></a>
</p>
<p align="center">
  <a href="https://github.com/Riccorl/golden-retriever/releases"><img alt="release" src="https://img.shields.io/github/v/release/Riccorl/golden-retriever"></a>
  <a href="https://github.com/Riccorl/golden-retriever/actions/workflows/python-publish-pypi.yml"><img alt="gh-status" src="https://github.com/Riccorl/golden-retriever/actions/workflows/python-publish-pypi.yml/badge.svg"></a>

</p>

# How to use

Install the library from [PyPI]():

```bash
pip install goldenretriever
```

or from source:

```bash
git clone https://github.com/Riccorl/golden-retriever.git
cd goldenretriever
pip install -e .
```

# Usage

## How to run an experiment

### Training

Here a simple example on how to train a DPR-like Retriever on the NQ dataset.
First download the dataset from (DPR)[]. The run the following code:

```python
from goldenretriever.trainer import Trainer
from goldenretriever import GoldenRetriever
from goldenretriever.data.datasets import InBatchNegativesDataset

# create a retriever
retriever = GoldenRetriever(
    question_encoder="intfloat/e5-small-v2",
    passage_encoder="intfloat/e5-small-v2"
)

# create a dataset
train_dataset = InBatchNegativesDataset(
    name="webq_train",
    path="path/to/webq_train.json",
    tokenizer=retriever.question_tokenizer,
    question_batch_size=64,
    passage_batch_size=400,
    max_passage_length=64,
    shuffle=True,
)
val_dataset = InBatchNegativesDataset(
    name="webq_dev",
    path="path/to/webq_dev.json",
    tokenizer=retriever.question_tokenizer,
    question_batch_size=64,
    passage_batch_size=400,
    max_passage_length=64,
)

trainer = Trainer(
    retriever=retriever,
    train_dataset=train_dataset,
    val_dataset=val_dataset,
    max_steps=25_000,
    wandb_online_mode=True,
    wandb_project_name="golden-retriever-dpr",
    wandb_experiment_name="e5-small-webq",
    max_hard_negatives_to_mine=5,
)

# start training
trainer.train()
```

### Evaluation

```python
from goldenretriever.trainer import Trainer
from goldenretriever import GoldenRetriever
from goldenretriever.data.datasets import InBatchNegativesDataset

retriever = GoldenRetriever(
  question_encoder="",
  document_index="",
  device="cuda",
  precision="16",
)

test_dataset = InBatchNegativesDataset(
  name="test",
  path="",
  tokenizer=retriever.question_tokenizer,
  question_batch_size=64,
  passage_batch_size=400,
  max_passage_length=64,
)

trainer = Trainer(
  retriever=retriever,
  test_dataset=test_dataset,
  log_to_wandb=False,
  top_k=[20, 100]
)

trainer.test()
```

## Inference

```python
from goldenretriever import GoldenRetriever

retriever = GoldenRetriever(
    question_encoder="path/to/question/encoder",
    passage_encoder="path/to/passage/encoder",
    document_index="path/to/document/index"
)

# retrieve documents
retriever.retrieve("What is the capital of France?", k=5)
```

## Data format

### Input data

The retriever expects a jsonl file similar to [DPR](https://github.com/facebookresearch/DPR):

```json lines
[
  {
	"question": "....",
	"answers": ["...", "...", "..."],
	"positive_ctxs": [{
		"title": "...",
		"text": "...."
	}],
	"negative_ctxs": ["..."],
	"hard_negative_ctxs": ["..."]
  },
  ...
]
```

### Index data

The document to index can be either a jsonl file or a tsv file similar to 
[DPR](https://github.com/facebookresearch/DPR):

- `jsonl`: each line is a json object with the following keys: `id`, `text`, `metadata`
- `tsv`: each line is a tab-separated string with the `id` and `text` column, 
  followed by any other column that will be stored in the `metadata` field

jsonl example:

```json lines
[
  {
    "id": "...",
    "text": "...",
    "metadata": ["{...}"]
  },
  ...
]
```

tsv example:

```tsv
id \t text \t any other column
...
```


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Riccorl/golden-retriever",
    "name": "goldenretriever-core",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": "",
    "keywords": "NLP deep learning transformer pytorch retriever rag dpr",
    "author": "Riccardo Orlando",
    "author_email": "orlandorcc@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/32/91/00ae709416a8bc22ac6559b38c6e9cc61272768f0bd8c68404041673ffc9/goldenretriever-core-0.9.0.tar.gz",
    "platform": null,
    "description": "<h1 align=\"center\">\n  \ud83e\uddae Golden Retriever\n</h1>\n\n<p align=\"center\">\n  <a href=\"https://pytorch.org/get-started/locally/\"><img alt=\"PyTorch\" src=\"https://img.shields.io/badge/PyTorch-orange?logo=pytorch\"></a>\n  <a href=\"https://pytorchlightning.ai/\"><img alt=\"Lightning\" src=\"https://img.shields.io/badge/-Lightning-blueviolet\"></a>\n  <a href=\"https://black.readthedocs.io/en/stable/\"><img alt=\"Code style: black\" src=\"https://img.shields.io/badge/code%20style-black-black.svg\"></a>\n  <a href=\"https://github.dev/Riccorl/golden-retriever\"><img alt=\"vscode\" src=\"https://img.shields.io/badge/preview%20in-vscode.dev-blue\"></a>\n</p>\n<p align=\"center\">\n  <a href=\"https://github.com/Riccorl/golden-retriever/releases\"><img alt=\"release\" src=\"https://img.shields.io/github/v/release/Riccorl/golden-retriever\"></a>\n  <a href=\"https://github.com/Riccorl/golden-retriever/actions/workflows/python-publish-pypi.yml\"><img alt=\"gh-status\" src=\"https://github.com/Riccorl/golden-retriever/actions/workflows/python-publish-pypi.yml/badge.svg\"></a>\n\n</p>\n\n# How to use\n\nInstall the library from [PyPI]():\n\n```bash\npip install goldenretriever\n```\n\nor from source:\n\n```bash\ngit clone https://github.com/Riccorl/golden-retriever.git\ncd goldenretriever\npip install -e .\n```\n\n# Usage\n\n## How to run an experiment\n\n### Training\n\nHere a simple example on how to train a DPR-like Retriever on the NQ dataset.\nFirst download the dataset from (DPR)[]. The run the following code:\n\n```python\nfrom goldenretriever.trainer import Trainer\nfrom goldenretriever import GoldenRetriever\nfrom goldenretriever.data.datasets import InBatchNegativesDataset\n\n# create a retriever\nretriever = GoldenRetriever(\n    question_encoder=\"intfloat/e5-small-v2\",\n    passage_encoder=\"intfloat/e5-small-v2\"\n)\n\n# create a dataset\ntrain_dataset = InBatchNegativesDataset(\n    name=\"webq_train\",\n    path=\"path/to/webq_train.json\",\n    tokenizer=retriever.question_tokenizer,\n    question_batch_size=64,\n    passage_batch_size=400,\n    max_passage_length=64,\n    shuffle=True,\n)\nval_dataset = InBatchNegativesDataset(\n    name=\"webq_dev\",\n    path=\"path/to/webq_dev.json\",\n    tokenizer=retriever.question_tokenizer,\n    question_batch_size=64,\n    passage_batch_size=400,\n    max_passage_length=64,\n)\n\ntrainer = Trainer(\n    retriever=retriever,\n    train_dataset=train_dataset,\n    val_dataset=val_dataset,\n    max_steps=25_000,\n    wandb_online_mode=True,\n    wandb_project_name=\"golden-retriever-dpr\",\n    wandb_experiment_name=\"e5-small-webq\",\n    max_hard_negatives_to_mine=5,\n)\n\n# start training\ntrainer.train()\n```\n\n### Evaluation\n\n```python\nfrom goldenretriever.trainer import Trainer\nfrom goldenretriever import GoldenRetriever\nfrom goldenretriever.data.datasets import InBatchNegativesDataset\n\nretriever = GoldenRetriever(\n  question_encoder=\"\",\n  document_index=\"\",\n  device=\"cuda\",\n  precision=\"16\",\n)\n\ntest_dataset = InBatchNegativesDataset(\n  name=\"test\",\n  path=\"\",\n  tokenizer=retriever.question_tokenizer,\n  question_batch_size=64,\n  passage_batch_size=400,\n  max_passage_length=64,\n)\n\ntrainer = Trainer(\n  retriever=retriever,\n  test_dataset=test_dataset,\n  log_to_wandb=False,\n  top_k=[20, 100]\n)\n\ntrainer.test()\n```\n\n## Inference\n\n```python\nfrom goldenretriever import GoldenRetriever\n\nretriever = GoldenRetriever(\n    question_encoder=\"path/to/question/encoder\",\n    passage_encoder=\"path/to/passage/encoder\",\n    document_index=\"path/to/document/index\"\n)\n\n# retrieve documents\nretriever.retrieve(\"What is the capital of France?\", k=5)\n```\n\n## Data format\n\n### Input data\n\nThe retriever expects a jsonl file similar to [DPR](https://github.com/facebookresearch/DPR):\n\n```json lines\n[\n  {\n\t\"question\": \"....\",\n\t\"answers\": [\"...\", \"...\", \"...\"],\n\t\"positive_ctxs\": [{\n\t\t\"title\": \"...\",\n\t\t\"text\": \"....\"\n\t}],\n\t\"negative_ctxs\": [\"...\"],\n\t\"hard_negative_ctxs\": [\"...\"]\n  },\n  ...\n]\n```\n\n### Index data\n\nThe document to index can be either a jsonl file or a tsv file similar to \n[DPR](https://github.com/facebookresearch/DPR):\n\n- `jsonl`: each line is a json object with the following keys: `id`, `text`, `metadata`\n- `tsv`: each line is a tab-separated string with the `id` and `text` column, \n  followed by any other column that will be stored in the `metadata` field\n\njsonl example:\n\n```json lines\n[\n  {\n    \"id\": \"...\",\n    \"text\": \"...\",\n    \"metadata\": [\"{...}\"]\n  },\n  ...\n]\n```\n\ntsv example:\n\n```tsv\nid \\t text \\t any other column\n...\n```\n\n",
    "bugtrack_url": null,
    "license": "Apache",
    "summary": "Dense Retriever",
    "version": "0.9.0",
    "project_urls": {
        "Homepage": "https://github.com/Riccorl/golden-retriever"
    },
    "split_keywords": [
        "nlp",
        "deep",
        "learning",
        "transformer",
        "pytorch",
        "retriever",
        "rag",
        "dpr"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c7e3f9dec231d20f0fb694019a2de108ef56ba0336c0bae04829c00d861c50e1",
                "md5": "f89b95e103ba8b7b348b0a58e974bcee",
                "sha256": "568578363ee88a4312b1943365e5c7930552a4f96842da9ef70dc15afcafbba5"
            },
            "downloads": -1,
            "filename": "goldenretriever_core-0.9.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f89b95e103ba8b7b348b0a58e974bcee",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 105000,
            "upload_time": "2024-02-07T18:19:09",
            "upload_time_iso_8601": "2024-02-07T18:19:09.060754Z",
            "url": "https://files.pythonhosted.org/packages/c7/e3/f9dec231d20f0fb694019a2de108ef56ba0336c0bae04829c00d861c50e1/goldenretriever_core-0.9.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "329100ae709416a8bc22ac6559b38c6e9cc61272768f0bd8c68404041673ffc9",
                "md5": "f2d5d92d1e9358aff29e7a6d8b6c20df",
                "sha256": "50efcc71df053b4f7fd4e826f3b938867fd99f8ec6fdcaaaf95660347bdc7287"
            },
            "downloads": -1,
            "filename": "goldenretriever-core-0.9.0.tar.gz",
            "has_sig": false,
            "md5_digest": "f2d5d92d1e9358aff29e7a6d8b6c20df",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 79090,
            "upload_time": "2024-02-07T18:19:11",
            "upload_time_iso_8601": "2024-02-07T18:19:11.275896Z",
            "url": "https://files.pythonhosted.org/packages/32/91/00ae709416a8bc22ac6559b38c6e9cc61272768f0bd8c68404041673ffc9/goldenretriever-core-0.9.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-07 18:19:11",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Riccorl",
    "github_project": "golden-retriever",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "goldenretriever-core"
}
        
Elapsed time: 0.17486s