lightning-ir

Name	lightning-ir JSON
Version	0.0.2 JSON
	download
home_page	None
Summary	Your one-stop shop for fine-tuning and running neural ranking models.
upload_time	2024-11-08 07:00:45
maintainer	None
docs_url	None
author	Ferdinand Schlatt
requires_python	>=3.9
license	Apache-2.0 license
keywords	information-retrieval
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Lightning IR

<p align="center">
<img src="./docs/_static/lightning-ir-logo.svg" alt="lightning ir logo" width="10%">
<p align="center">Your one-stop shop for fine-tuning and running neural ranking models.</p>
</p>

-----------------

Lightning IR is a library for fine-tuning and running neural ranking models. It is built on top of [PyTorch Lightning](https://lightning.ai/docs/pytorch/stable/) to provide a simple and flexible interface to interact with neural ranking models.

Want to:

- fine-tune your own cross- or bi-encoder models?
- index and search through a collection of documents with ColBERT or SPLADE?
- re-rank documents with state-of-the-art models?

Lightning IR has you covered!
  
## Installation

Lightning IR can be installed using pip:

```
pip install lightning-ir
```

## Getting Started

See the [Quickstart](https://webis-de.github.io/lightning-ir/quickstart.html) guide for an introduction to Lightning IR. The [Documentation](https://webis-de.github.io/lightning-ir/) provides a detailed overview of the library's functionality.

The easiest way to use Lightning IR is via the CLI. It uses the [PyTorch Lightning CLI](https://lightning.ai/docs/pytorch/stable/cli/lightning_cli.html#lightning-cli) and adds additional options to provide a unified interface for fine-tuning and running neural ranking models.

The behavior of the CLI can be customized using yaml configuration files. See the [configs](configs) directory for several example configuration files. For example, the following command can be used to re-rank the official TREC DL 19/20 re-ranking set with a pre-finetuned cross-encoder model. It will automatically download the model and data, run the re-ranking, write the results to a TREC-style run file, and report the nDCG@10 score.

```bash
lightning-ir re_rank \
  --config ./configs/trainer/inference.yaml \
  --config ./configs/callbacks/rank.yaml \
  --config ./configs/data/re-rank-trec-dl.yaml \
  --config ./configs/models/monoelectra.yaml
```

For more details, see the [Usage](#usage) section.

## Usage

### Command Line Interface

The CLI offers four subcommands:

```
$ lightning-ir -h
Lightning Trainer command line tool

subcommands:
  For more details of each subcommand, add it as an argument followed by --help.

  Available subcommands:
    fit                 Runs the full optimization routine.
    index               Index a collection of documents.
    search              Search for relevant documents.
    re_rank             Re-rank a set of retrieved documents.
```

Configurations files need to be provided to specifiy model, data, and fine-tuning/inference parameters. See the [configs](configs) directory for examples. Four types of configurations exists:

- `trainer`: Specifies the fine-tuning/inference parameters and callbacks.
- `model`: Specifies the model to use and its parameters.
- `data`: Specifies the dataset(s) to use and its parameters.
- `optimizer`: Specifies the optimizer parameters (only needed for fine-tuning).

### Example

The following example demonstrates how to fine-tune a BERT-based single-vector bi-encoder model using the official MS MARCO triples. The fine-tuned model is then used to index the MS MARCO passage collection and search for relevant passages. Finally, we show how to re-rank the retrieved passages.

#### Fine-tuning

To fine-tune a bi-encoder model on the MS MARCO triples dataset, use the following configuration file and command:

<details>

<summary>bi-encoder-fit.yaml</summary>

```yaml
trainer:
  callbacks:
  - class_path: ModelCheckpoint
  max_epochs: 1
  max_steps: 100000
data:
  class_path: LightningIRDataModule
  init_args:
    train_batch_size: 32
    train_dataset:
      class_path: TupleDataset
      init_args:
        tuples_dataset: msmarco-passage/train/triples-small
model:
  class_path: BiEncoderModule
  init_args:
    model_name_or_path: bert-base-uncased
    config:
      class_path: BiEncoderConfig
    loss_functions:
    - class_path: RankNet
optimizer:
  class_path: AdamW
  init_args:
    lr: 1e-5
```

</details>

```bash
lightning-ir fit --config bi-encoder-fit.yaml
```

The fine-tuned model is saved in the directory `lightning_logs/version_X/huggingface_checkpoint/`.

#### Indexing

We now assume the model from the previous fine-tuning step was moved to the directory `models/bi-encoder`. To index the MS MARCO passage collection with [faiss](https://github.com/facebookresearch/faiss) using the fine-tuned model, use the following configuration file and command:

<details>

<summary>bi-encoder-index.yaml</summary>

```yaml
trainer:
  callbacks:
  - class_path: IndexCallback
    init_args:
        index_config:
          class_path: FaissFlatIndexConfig
model:
  class_path: BiEncoderModule
  init_args:
    model_name_or_path: models/bi-encoder
data:
  class_path: LightningIRDataModule
  init_args:
    num_workers: 1
    inference_batch_size: 256
    inference_datasets:
    - class_path: DocDataset
      init_args:
        doc_dataset: msmarco-passage
```

</details>

```bash
lightning-ir index --config bi-encoder-index.yaml
```

The index is saved in the directory `models/bi-encoder/indexes/msmarco-passage`.

#### Searching

To search for relevant documents in the MS MARCO passage collection using the bi-encoder and index, use the following configuration file and command:

<details>

<summary>bi-encoder-search.yaml</summary>

```yaml
trainer:
  callbacks:
  - class_path: RankCallback
model:
  class_path: BiEncoderModule
  init_args:
    model_name_or_path: models/bi-encoder
    index_dir: models/bi-encoder/indexes/msmarco-passage
    search_config:
      class_path: FaissFlatSearchConfig
      init_args:
        k: 100
    evaluation_metrics:
    - nDCG@10
data:
  class_path: LightningIRDataModule
  init_args:
    num_workers: 1
    inference_batch_size: 4
    inference_datasets:
    - class_path: QueryDataset
      init_args:
        query_dataset: msmarco-passage/trec-dl-2019/judged
    - class_path: QueryDataset
      init_args:
        query_dataset: msmarco-passage/trec-dl-2020/judged
```

</details>

```bash
lightning-ir search --config bi-encoder-search.yaml
```

The run files are saved as `models/bi-encoder/runs/msmarco-passage-trec-dl-20XX.run`. Additionally, the nDCG@10 scores are printed to the console.

#### Re-ranking

Assuming we've also fine-tuned a cross-encoder that is saved in the directory `models/cross-encoder`, we can re-rank the retrieved documents using the following configuration file and command:

<details>

<summary>cross-encoder-re-rank.yaml</summary>

```yaml
trainer:
  callbacks:
  - class_path: RankCallback
model:
  class_path: BiEncoderModule
  init_args:
    model_name_or_path: models/cross-encoder
    evaluation_metrics:
    - nDCG@10
data:
  class_path: LightningIRDataModule
  init_args:
    num_workers: 1
    inference_batch_size: 4
    inference_datasets:
    - class_path: RunDataset
      init_args:
        run_path_or_id: models/bi-encoder/runs/msmarco-passage-trec-dl-2019.run
        depth: 100
        sample_size: 100
        sampling_strategy: top
    - class_path: RunDataset
      init_args:
        run_path_or_id: models/bi-encoder/runs/msmarco-passage-trec-dl-2020.run
        depth: 100
        sample_size: 100
        sampling_strategy: top
```

</details>

```bash
lightning-ir re_rank --config cross-encoder-re-rank.yaml
```

The run files are saved as `models/cross-encoder/runs/msmarco-passage-trec-dl-20XX.run`. Additionally, the nDCG@10 scores are printed to the console.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "lightning-ir",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "information-retrieval",
    "author": "Ferdinand Schlatt",
    "author_email": "ferdinand.schlatt@uni-jena.de",
    "download_url": "https://files.pythonhosted.org/packages/f6/68/14bf0eb59852f51659aec2fcea993a5f3827a6aaed97a690e3dd6f568c5c/lightning_ir-0.0.2.tar.gz",
    "platform": null,
    "description": "# Lightning IR\n\n<p align=\"center\">\n<img src=\"./docs/_static/lightning-ir-logo.svg\" alt=\"lightning ir logo\" width=\"10%\">\n<p align=\"center\">Your one-stop shop for fine-tuning and running neural ranking models.</p>\n</p>\n\n-----------------\n\nLightning IR is a library for fine-tuning and running neural ranking models. It is built on top of [PyTorch Lightning](https://lightning.ai/docs/pytorch/stable/) to provide a simple and flexible interface to interact with neural ranking models.\n\nWant to:\n\n- fine-tune your own cross- or bi-encoder models?\n- index and search through a collection of documents with ColBERT or SPLADE?\n- re-rank documents with state-of-the-art models?\n\nLightning IR has you covered!\n  \n## Installation\n\nLightning IR can be installed using pip:\n\n```\npip install lightning-ir\n```\n\n## Getting Started\n\nSee the [Quickstart](https://webis-de.github.io/lightning-ir/quickstart.html) guide for an introduction to Lightning IR. The [Documentation](https://webis-de.github.io/lightning-ir/) provides a detailed overview of the library's functionality.\n\nThe easiest way to use Lightning IR is via the CLI. It uses the [PyTorch Lightning CLI](https://lightning.ai/docs/pytorch/stable/cli/lightning_cli.html#lightning-cli) and adds additional options to provide a unified interface for fine-tuning and running neural ranking models.\n\nThe behavior of the CLI can be customized using yaml configuration files. See the [configs](configs) directory for several example configuration files. For example, the following command can be used to re-rank the official TREC DL 19/20 re-ranking set with a pre-finetuned cross-encoder model. It will automatically download the model and data, run the re-ranking, write the results to a TREC-style run file, and report the nDCG@10 score.\n\n```bash\nlightning-ir re_rank \\\n  --config ./configs/trainer/inference.yaml \\\n  --config ./configs/callbacks/rank.yaml \\\n  --config ./configs/data/re-rank-trec-dl.yaml \\\n  --config ./configs/models/monoelectra.yaml\n```\n\nFor more details, see the [Usage](#usage) section.\n\n## Usage\n\n### Command Line Interface\n\nThe CLI offers four subcommands:\n\n```\n$ lightning-ir -h\nLightning Trainer command line tool\n\nsubcommands:\n  For more details of each subcommand, add it as an argument followed by --help.\n\n  Available subcommands:\n    fit                 Runs the full optimization routine.\n    index               Index a collection of documents.\n    search              Search for relevant documents.\n    re_rank             Re-rank a set of retrieved documents.\n```\n\nConfigurations files need to be provided to specifiy model, data, and fine-tuning/inference parameters. See the [configs](configs) directory for examples. Four types of configurations exists:\n\n- `trainer`: Specifies the fine-tuning/inference parameters and callbacks.\n- `model`: Specifies the model to use and its parameters.\n- `data`: Specifies the dataset(s) to use and its parameters.\n- `optimizer`: Specifies the optimizer parameters (only needed for fine-tuning).\n\n### Example\n\nThe following example demonstrates how to fine-tune a BERT-based single-vector bi-encoder model using the official MS MARCO triples. The fine-tuned model is then used to index the MS MARCO passage collection and search for relevant passages. Finally, we show how to re-rank the retrieved passages.\n\n#### Fine-tuning\n\nTo fine-tune a bi-encoder model on the MS MARCO triples dataset, use the following configuration file and command:\n\n<details>\n\n<summary>bi-encoder-fit.yaml</summary>\n\n```yaml\ntrainer:\n  callbacks:\n  - class_path: ModelCheckpoint\n  max_epochs: 1\n  max_steps: 100000\ndata:\n  class_path: LightningIRDataModule\n  init_args:\n    train_batch_size: 32\n    train_dataset:\n      class_path: TupleDataset\n      init_args:\n        tuples_dataset: msmarco-passage/train/triples-small\nmodel:\n  class_path: BiEncoderModule\n  init_args:\n    model_name_or_path: bert-base-uncased\n    config:\n      class_path: BiEncoderConfig\n    loss_functions:\n    - class_path: RankNet\noptimizer:\n  class_path: AdamW\n  init_args:\n    lr: 1e-5\n```\n\n</details>\n\n```bash\nlightning-ir fit --config bi-encoder-fit.yaml\n```\n\nThe fine-tuned model is saved in the directory `lightning_logs/version_X/huggingface_checkpoint/`.\n\n#### Indexing\n\nWe now assume the model from the previous fine-tuning step was moved to the directory `models/bi-encoder`. To index the MS MARCO passage collection with [faiss](https://github.com/facebookresearch/faiss) using the fine-tuned model, use the following configuration file and command:\n\n<details>\n\n<summary>bi-encoder-index.yaml</summary>\n\n```yaml\ntrainer:\n  callbacks:\n  - class_path: IndexCallback\n    init_args:\n        index_config:\n          class_path: FaissFlatIndexConfig\nmodel:\n  class_path: BiEncoderModule\n  init_args:\n    model_name_or_path: models/bi-encoder\ndata:\n  class_path: LightningIRDataModule\n  init_args:\n    num_workers: 1\n    inference_batch_size: 256\n    inference_datasets:\n    - class_path: DocDataset\n      init_args:\n        doc_dataset: msmarco-passage\n```\n\n</details>\n\n```bash\nlightning-ir index --config bi-encoder-index.yaml\n```\n\nThe index is saved in the directory `models/bi-encoder/indexes/msmarco-passage`.\n\n#### Searching\n\nTo search for relevant documents in the MS MARCO passage collection using the bi-encoder and index, use the following configuration file and command:\n\n<details>\n\n<summary>bi-encoder-search.yaml</summary>\n\n```yaml\ntrainer:\n  callbacks:\n  - class_path: RankCallback\nmodel:\n  class_path: BiEncoderModule\n  init_args:\n    model_name_or_path: models/bi-encoder\n    index_dir: models/bi-encoder/indexes/msmarco-passage\n    search_config:\n      class_path: FaissFlatSearchConfig\n      init_args:\n        k: 100\n    evaluation_metrics:\n    - nDCG@10\ndata:\n  class_path: LightningIRDataModule\n  init_args:\n    num_workers: 1\n    inference_batch_size: 4\n    inference_datasets:\n    - class_path: QueryDataset\n      init_args:\n        query_dataset: msmarco-passage/trec-dl-2019/judged\n    - class_path: QueryDataset\n      init_args:\n        query_dataset: msmarco-passage/trec-dl-2020/judged\n```\n\n</details>\n\n```bash\nlightning-ir search --config bi-encoder-search.yaml\n```\n\nThe run files are saved as `models/bi-encoder/runs/msmarco-passage-trec-dl-20XX.run`. Additionally, the nDCG@10 scores are printed to the console.\n\n#### Re-ranking\n\nAssuming we've also fine-tuned a cross-encoder that is saved in the directory `models/cross-encoder`, we can re-rank the retrieved documents using the following configuration file and command:\n\n<details>\n\n<summary>cross-encoder-re-rank.yaml</summary>\n\n```yaml\ntrainer:\n  callbacks:\n  - class_path: RankCallback\nmodel:\n  class_path: BiEncoderModule\n  init_args:\n    model_name_or_path: models/cross-encoder\n    evaluation_metrics:\n    - nDCG@10\ndata:\n  class_path: LightningIRDataModule\n  init_args:\n    num_workers: 1\n    inference_batch_size: 4\n    inference_datasets:\n    - class_path: RunDataset\n      init_args:\n        run_path_or_id: models/bi-encoder/runs/msmarco-passage-trec-dl-2019.run\n        depth: 100\n        sample_size: 100\n        sampling_strategy: top\n    - class_path: RunDataset\n      init_args:\n        run_path_or_id: models/bi-encoder/runs/msmarco-passage-trec-dl-2020.run\n        depth: 100\n        sample_size: 100\n        sampling_strategy: top\n```\n\n</details>\n\n```bash\nlightning-ir re_rank --config cross-encoder-re-rank.yaml\n```\n\nThe run files are saved as `models/cross-encoder/runs/msmarco-passage-trec-dl-20XX.run`. Additionally, the nDCG@10 scores are printed to the console.\n",
    "bugtrack_url": null,
    "license": "Apache-2.0 license",
    "summary": "Your one-stop shop for fine-tuning and running neural ranking models.",
    "version": "0.0.2",
    "project_urls": null,
    "split_keywords": [
        "information-retrieval"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3499958baab66580172bb3a0c493ad00c8869e7eb700d90dee94c85db6f5ce57",
                "md5": "058f0f3b0d847fa5d3c623bdd9c619b3",
                "sha256": "6a511f618fed7f0f9cd33c706b1942ee7bb25d855f8dc31ef9b784399ece3bd5"
            },
            "downloads": -1,
            "filename": "lightning_ir-0.0.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "058f0f3b0d847fa5d3c623bdd9c619b3",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 82607,
            "upload_time": "2024-11-08T07:00:43",
            "upload_time_iso_8601": "2024-11-08T07:00:43.359358Z",
            "url": "https://files.pythonhosted.org/packages/34/99/958baab66580172bb3a0c493ad00c8869e7eb700d90dee94c85db6f5ce57/lightning_ir-0.0.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f66814bf0eb59852f51659aec2fcea993a5f3827a6aaed97a690e3dd6f568c5c",
                "md5": "e261708e6c970039b7ffaf7dc00f68e7",
                "sha256": "39a873a90b98efeceacea994218cd0bac65475bfdb73ff612c0eb0f97d857c49"
            },
            "downloads": -1,
            "filename": "lightning_ir-0.0.2.tar.gz",
            "has_sig": false,
            "md5_digest": "e261708e6c970039b7ffaf7dc00f68e7",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 69365,
            "upload_time": "2024-11-08T07:00:45",
            "upload_time_iso_8601": "2024-11-08T07:00:45.336021Z",
            "url": "https://files.pythonhosted.org/packages/f6/68/14bf0eb59852f51659aec2fcea993a5f3827a6aaed97a690e3dd6f568c5c/lightning_ir-0.0.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-08 07:00:45",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "lightning-ir"
}

Ferdinand Schlatt