# Lightning IR
<p align="center">
<img src="./docs/_static/lightning-ir-logo.svg" alt="lightning ir logo" width="10%">
<p align="center">Your one-stop shop for fine-tuning and running neural ranking models.</p>
</p>
-----------------
Lightning IR is a library for fine-tuning and running neural ranking models. It is built on top of [PyTorch Lightning](https://lightning.ai/docs/pytorch/stable/) to provide a simple and flexible interface to interact with neural ranking models.
Want to:
- fine-tune your own cross- or bi-encoder models?
- index and search through a collection of documents with ColBERT or SPLADE?
- re-rank documents with state-of-the-art models?
Lightning IR has you covered!
## Installation
Lightning IR can be installed using pip:
```
pip install lightning-ir
```
## Getting Started
See the [Quickstart](https://webis-de.github.io/lightning-ir/quickstart.html) guide for an introduction to Lightning IR. The [Documentation](https://webis-de.github.io/lightning-ir/) provides a detailed overview of the library's functionality.
The easiest way to use Lightning IR is via the CLI. It uses the [PyTorch Lightning CLI](https://lightning.ai/docs/pytorch/stable/cli/lightning_cli.html#lightning-cli) and adds additional options to provide a unified interface for fine-tuning and running neural ranking models.
The behavior of the CLI can be customized using yaml configuration files. See the [configs](configs) directory for several example configuration files. For example, the following command can be used to re-rank the official TREC DL 19/20 re-ranking set with a pre-finetuned cross-encoder model. It will automatically download the model and data, run the re-ranking, write the results to a TREC-style run file, and report the nDCG@10 score.
```bash
lightning-ir re_rank \
--config ./configs/trainer/inference.yaml \
--config ./configs/callbacks/rank.yaml \
--config ./configs/data/re-rank-trec-dl.yaml \
--config ./configs/models/monoelectra.yaml
```
For more details, see the [Usage](#usage) section.
## Usage
### Command Line Interface
The CLI offers four subcommands:
```
$ lightning-ir -h
Lightning Trainer command line tool
subcommands:
For more details of each subcommand, add it as an argument followed by --help.
Available subcommands:
fit Runs the full optimization routine.
index Index a collection of documents.
search Search for relevant documents.
re_rank Re-rank a set of retrieved documents.
```
Configurations files need to be provided to specifiy model, data, and fine-tuning/inference parameters. See the [configs](configs) directory for examples. Four types of configurations exists:
- `trainer`: Specifies the fine-tuning/inference parameters and callbacks.
- `model`: Specifies the model to use and its parameters.
- `data`: Specifies the dataset(s) to use and its parameters.
- `optimizer`: Specifies the optimizer parameters (only needed for fine-tuning).
### Example
The following example demonstrates how to fine-tune a BERT-based single-vector bi-encoder model using the official MS MARCO triples. The fine-tuned model is then used to index the MS MARCO passage collection and search for relevant passages. Finally, we show how to re-rank the retrieved passages.
#### Fine-tuning
To fine-tune a bi-encoder model on the MS MARCO triples dataset, use the following configuration file and command:
<details>
<summary>bi-encoder-fit.yaml</summary>
```yaml
trainer:
callbacks:
- class_path: ModelCheckpoint
max_epochs: 1
max_steps: 100000
data:
class_path: LightningIRDataModule
init_args:
train_batch_size: 32
train_dataset:
class_path: TupleDataset
init_args:
tuples_dataset: msmarco-passage/train/triples-small
model:
class_path: BiEncoderModule
init_args:
model_name_or_path: bert-base-uncased
config:
class_path: BiEncoderConfig
loss_functions:
- class_path: RankNet
optimizer:
class_path: AdamW
init_args:
lr: 1e-5
```
</details>
```bash
lightning-ir fit --config bi-encoder-fit.yaml
```
The fine-tuned model is saved in the directory `lightning_logs/version_X/huggingface_checkpoint/`.
#### Indexing
We now assume the model from the previous fine-tuning step was moved to the directory `models/bi-encoder`. To index the MS MARCO passage collection with [faiss](https://github.com/facebookresearch/faiss) using the fine-tuned model, use the following configuration file and command:
<details>
<summary>bi-encoder-index.yaml</summary>
```yaml
trainer:
callbacks:
- class_path: IndexCallback
init_args:
index_config:
class_path: FaissFlatIndexConfig
model:
class_path: BiEncoderModule
init_args:
model_name_or_path: models/bi-encoder
data:
class_path: LightningIRDataModule
init_args:
num_workers: 1
inference_batch_size: 256
inference_datasets:
- class_path: DocDataset
init_args:
doc_dataset: msmarco-passage
```
</details>
```bash
lightning-ir index --config bi-encoder-index.yaml
```
The index is saved in the directory `models/bi-encoder/indexes/msmarco-passage`.
#### Searching
To search for relevant documents in the MS MARCO passage collection using the bi-encoder and index, use the following configuration file and command:
<details>
<summary>bi-encoder-search.yaml</summary>
```yaml
trainer:
callbacks:
- class_path: RankCallback
model:
class_path: BiEncoderModule
init_args:
model_name_or_path: models/bi-encoder
index_dir: models/bi-encoder/indexes/msmarco-passage
search_config:
class_path: FaissFlatSearchConfig
init_args:
k: 100
evaluation_metrics:
- nDCG@10
data:
class_path: LightningIRDataModule
init_args:
num_workers: 1
inference_batch_size: 4
inference_datasets:
- class_path: QueryDataset
init_args:
query_dataset: msmarco-passage/trec-dl-2019/judged
- class_path: QueryDataset
init_args:
query_dataset: msmarco-passage/trec-dl-2020/judged
```
</details>
```bash
lightning-ir search --config bi-encoder-search.yaml
```
The run files are saved as `models/bi-encoder/runs/msmarco-passage-trec-dl-20XX.run`. Additionally, the nDCG@10 scores are printed to the console.
#### Re-ranking
Assuming we've also fine-tuned a cross-encoder that is saved in the directory `models/cross-encoder`, we can re-rank the retrieved documents using the following configuration file and command:
<details>
<summary>cross-encoder-re-rank.yaml</summary>
```yaml
trainer:
callbacks:
- class_path: RankCallback
model:
class_path: BiEncoderModule
init_args:
model_name_or_path: models/cross-encoder
evaluation_metrics:
- nDCG@10
data:
class_path: LightningIRDataModule
init_args:
num_workers: 1
inference_batch_size: 4
inference_datasets:
- class_path: RunDataset
init_args:
run_path_or_id: models/bi-encoder/runs/msmarco-passage-trec-dl-2019.run
depth: 100
sample_size: 100
sampling_strategy: top
- class_path: RunDataset
init_args:
run_path_or_id: models/bi-encoder/runs/msmarco-passage-trec-dl-2020.run
depth: 100
sample_size: 100
sampling_strategy: top
```
</details>
```bash
lightning-ir re_rank --config cross-encoder-re-rank.yaml
```
The run files are saved as `models/cross-encoder/runs/msmarco-passage-trec-dl-20XX.run`. Additionally, the nDCG@10 scores are printed to the console.
Raw data
{
"_id": null,
"home_page": null,
"name": "lightning-ir",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "information-retrieval",
"author": "Ferdinand Schlatt",
"author_email": "ferdinand.schlatt@uni-jena.de",
"download_url": "https://files.pythonhosted.org/packages/f6/68/14bf0eb59852f51659aec2fcea993a5f3827a6aaed97a690e3dd6f568c5c/lightning_ir-0.0.2.tar.gz",
"platform": null,
"description": "# Lightning IR\n\n<p align=\"center\">\n<img src=\"./docs/_static/lightning-ir-logo.svg\" alt=\"lightning ir logo\" width=\"10%\">\n<p align=\"center\">Your one-stop shop for fine-tuning and running neural ranking models.</p>\n</p>\n\n-----------------\n\nLightning IR is a library for fine-tuning and running neural ranking models. It is built on top of [PyTorch Lightning](https://lightning.ai/docs/pytorch/stable/) to provide a simple and flexible interface to interact with neural ranking models.\n\nWant to:\n\n- fine-tune your own cross- or bi-encoder models?\n- index and search through a collection of documents with ColBERT or SPLADE?\n- re-rank documents with state-of-the-art models?\n\nLightning IR has you covered!\n \n## Installation\n\nLightning IR can be installed using pip:\n\n```\npip install lightning-ir\n```\n\n## Getting Started\n\nSee the [Quickstart](https://webis-de.github.io/lightning-ir/quickstart.html) guide for an introduction to Lightning IR. The [Documentation](https://webis-de.github.io/lightning-ir/) provides a detailed overview of the library's functionality.\n\nThe easiest way to use Lightning IR is via the CLI. It uses the [PyTorch Lightning CLI](https://lightning.ai/docs/pytorch/stable/cli/lightning_cli.html#lightning-cli) and adds additional options to provide a unified interface for fine-tuning and running neural ranking models.\n\nThe behavior of the CLI can be customized using yaml configuration files. See the [configs](configs) directory for several example configuration files. For example, the following command can be used to re-rank the official TREC DL 19/20 re-ranking set with a pre-finetuned cross-encoder model. It will automatically download the model and data, run the re-ranking, write the results to a TREC-style run file, and report the nDCG@10 score.\n\n```bash\nlightning-ir re_rank \\\n --config ./configs/trainer/inference.yaml \\\n --config ./configs/callbacks/rank.yaml \\\n --config ./configs/data/re-rank-trec-dl.yaml \\\n --config ./configs/models/monoelectra.yaml\n```\n\nFor more details, see the [Usage](#usage) section.\n\n## Usage\n\n### Command Line Interface\n\nThe CLI offers four subcommands:\n\n```\n$ lightning-ir -h\nLightning Trainer command line tool\n\nsubcommands:\n For more details of each subcommand, add it as an argument followed by --help.\n\n Available subcommands:\n fit Runs the full optimization routine.\n index Index a collection of documents.\n search Search for relevant documents.\n re_rank Re-rank a set of retrieved documents.\n```\n\nConfigurations files need to be provided to specifiy model, data, and fine-tuning/inference parameters. See the [configs](configs) directory for examples. Four types of configurations exists:\n\n- `trainer`: Specifies the fine-tuning/inference parameters and callbacks.\n- `model`: Specifies the model to use and its parameters.\n- `data`: Specifies the dataset(s) to use and its parameters.\n- `optimizer`: Specifies the optimizer parameters (only needed for fine-tuning).\n\n### Example\n\nThe following example demonstrates how to fine-tune a BERT-based single-vector bi-encoder model using the official MS MARCO triples. The fine-tuned model is then used to index the MS MARCO passage collection and search for relevant passages. Finally, we show how to re-rank the retrieved passages.\n\n#### Fine-tuning\n\nTo fine-tune a bi-encoder model on the MS MARCO triples dataset, use the following configuration file and command:\n\n<details>\n\n<summary>bi-encoder-fit.yaml</summary>\n\n```yaml\ntrainer:\n callbacks:\n - class_path: ModelCheckpoint\n max_epochs: 1\n max_steps: 100000\ndata:\n class_path: LightningIRDataModule\n init_args:\n train_batch_size: 32\n train_dataset:\n class_path: TupleDataset\n init_args:\n tuples_dataset: msmarco-passage/train/triples-small\nmodel:\n class_path: BiEncoderModule\n init_args:\n model_name_or_path: bert-base-uncased\n config:\n class_path: BiEncoderConfig\n loss_functions:\n - class_path: RankNet\noptimizer:\n class_path: AdamW\n init_args:\n lr: 1e-5\n```\n\n</details>\n\n```bash\nlightning-ir fit --config bi-encoder-fit.yaml\n```\n\nThe fine-tuned model is saved in the directory `lightning_logs/version_X/huggingface_checkpoint/`.\n\n#### Indexing\n\nWe now assume the model from the previous fine-tuning step was moved to the directory `models/bi-encoder`. To index the MS MARCO passage collection with [faiss](https://github.com/facebookresearch/faiss) using the fine-tuned model, use the following configuration file and command:\n\n<details>\n\n<summary>bi-encoder-index.yaml</summary>\n\n```yaml\ntrainer:\n callbacks:\n - class_path: IndexCallback\n init_args:\n index_config:\n class_path: FaissFlatIndexConfig\nmodel:\n class_path: BiEncoderModule\n init_args:\n model_name_or_path: models/bi-encoder\ndata:\n class_path: LightningIRDataModule\n init_args:\n num_workers: 1\n inference_batch_size: 256\n inference_datasets:\n - class_path: DocDataset\n init_args:\n doc_dataset: msmarco-passage\n```\n\n</details>\n\n```bash\nlightning-ir index --config bi-encoder-index.yaml\n```\n\nThe index is saved in the directory `models/bi-encoder/indexes/msmarco-passage`.\n\n#### Searching\n\nTo search for relevant documents in the MS MARCO passage collection using the bi-encoder and index, use the following configuration file and command:\n\n<details>\n\n<summary>bi-encoder-search.yaml</summary>\n\n```yaml\ntrainer:\n callbacks:\n - class_path: RankCallback\nmodel:\n class_path: BiEncoderModule\n init_args:\n model_name_or_path: models/bi-encoder\n index_dir: models/bi-encoder/indexes/msmarco-passage\n search_config:\n class_path: FaissFlatSearchConfig\n init_args:\n k: 100\n evaluation_metrics:\n - nDCG@10\ndata:\n class_path: LightningIRDataModule\n init_args:\n num_workers: 1\n inference_batch_size: 4\n inference_datasets:\n - class_path: QueryDataset\n init_args:\n query_dataset: msmarco-passage/trec-dl-2019/judged\n - class_path: QueryDataset\n init_args:\n query_dataset: msmarco-passage/trec-dl-2020/judged\n```\n\n</details>\n\n```bash\nlightning-ir search --config bi-encoder-search.yaml\n```\n\nThe run files are saved as `models/bi-encoder/runs/msmarco-passage-trec-dl-20XX.run`. Additionally, the nDCG@10 scores are printed to the console.\n\n#### Re-ranking\n\nAssuming we've also fine-tuned a cross-encoder that is saved in the directory `models/cross-encoder`, we can re-rank the retrieved documents using the following configuration file and command:\n\n<details>\n\n<summary>cross-encoder-re-rank.yaml</summary>\n\n```yaml\ntrainer:\n callbacks:\n - class_path: RankCallback\nmodel:\n class_path: BiEncoderModule\n init_args:\n model_name_or_path: models/cross-encoder\n evaluation_metrics:\n - nDCG@10\ndata:\n class_path: LightningIRDataModule\n init_args:\n num_workers: 1\n inference_batch_size: 4\n inference_datasets:\n - class_path: RunDataset\n init_args:\n run_path_or_id: models/bi-encoder/runs/msmarco-passage-trec-dl-2019.run\n depth: 100\n sample_size: 100\n sampling_strategy: top\n - class_path: RunDataset\n init_args:\n run_path_or_id: models/bi-encoder/runs/msmarco-passage-trec-dl-2020.run\n depth: 100\n sample_size: 100\n sampling_strategy: top\n```\n\n</details>\n\n```bash\nlightning-ir re_rank --config cross-encoder-re-rank.yaml\n```\n\nThe run files are saved as `models/cross-encoder/runs/msmarco-passage-trec-dl-20XX.run`. Additionally, the nDCG@10 scores are printed to the console.\n",
"bugtrack_url": null,
"license": "Apache-2.0 license",
"summary": "Your one-stop shop for fine-tuning and running neural ranking models.",
"version": "0.0.2",
"project_urls": null,
"split_keywords": [
"information-retrieval"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "3499958baab66580172bb3a0c493ad00c8869e7eb700d90dee94c85db6f5ce57",
"md5": "058f0f3b0d847fa5d3c623bdd9c619b3",
"sha256": "6a511f618fed7f0f9cd33c706b1942ee7bb25d855f8dc31ef9b784399ece3bd5"
},
"downloads": -1,
"filename": "lightning_ir-0.0.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "058f0f3b0d847fa5d3c623bdd9c619b3",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 82607,
"upload_time": "2024-11-08T07:00:43",
"upload_time_iso_8601": "2024-11-08T07:00:43.359358Z",
"url": "https://files.pythonhosted.org/packages/34/99/958baab66580172bb3a0c493ad00c8869e7eb700d90dee94c85db6f5ce57/lightning_ir-0.0.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "f66814bf0eb59852f51659aec2fcea993a5f3827a6aaed97a690e3dd6f568c5c",
"md5": "e261708e6c970039b7ffaf7dc00f68e7",
"sha256": "39a873a90b98efeceacea994218cd0bac65475bfdb73ff612c0eb0f97d857c49"
},
"downloads": -1,
"filename": "lightning_ir-0.0.2.tar.gz",
"has_sig": false,
"md5_digest": "e261708e6c970039b7ffaf7dc00f68e7",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 69365,
"upload_time": "2024-11-08T07:00:45",
"upload_time_iso_8601": "2024-11-08T07:00:45.336021Z",
"url": "https://files.pythonhosted.org/packages/f6/68/14bf0eb59852f51659aec2fcea993a5f3827a6aaed97a690e3dd6f568c5c/lightning_ir-0.0.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-08 07:00:45",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "lightning-ir"
}