PLAID-X

Name	PLAID-X JSON
Version	0.3.1 JSON
	download
home_page	https://github.com/hltcoe/ColBERT-X/tree/plaid-x
Summary	Efficient and Effective Passage Search via Contextualized Late Interaction over BERT and XLM-RoBERTa
upload_time	2024-02-27 23:25:57
maintainer
docs_url	None
author	Eugene Yang
requires_python	>=3.8
license
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # PLAID-X

This is a generalized version of [PLAID](https://github.com/stanford-futuredata/ColBERT) and the previous ColBERT-X for CLIR.
The codebase supports models trained with the original ColBERT-X scripts, which are not compatible with the PLAID codebase released from the Stadford Futuredata Group. 

## Installation

PLAID-X is available on PyPi. You can install it through
```bash
pip install PLAID-X
```

Make sure your gcc and gxx version is `>=9.4.0`, which is the requirement for `ninja` to work properly.
We recommand using a `conda` environment to control it.

## Usage

We have published a [tutorial](https://github.com/hltcoe/clir-tutorial) on CLIR with notebooks to run various models. 
Please refer to the [PLAID-X notebook](https://colab.research.google.com/github/hltcoe/clir-tutorial/blob/main/notebooks/clir_tutorial_plaidx.ipynb) there for a simple working example in Python. 

The following provides a series of CLI commands for running a larger scale. 

### Training Models

The following command starts the training process using the `t53b-monot5-msmarco-engeng.jsonl.gz` triple file on the Huggingface Dataset repository [`hltcoe/tdist-msmarco-scores`](https://huggingface.co/datasets/hltcoe/tdist-msmarco-scores) with English queries and translated Chinese passages from [neuMARCO](https://ir-datasets.com/neumarco.html).

```bash
python -m colbert.scripts.train \
--model_name xlm-roberta-large \
--training_triples hltcoe/tdist-msmarco-scores:t53b-monot5-msmarco-engeng.jsonl.gz \
--training_irds_id neumarco/zh/train \
--maxsteps 200000 \
--learning_rate 5e-6 \
--kd_loss KLD \
--only_top \
--per_device_batch_size 8 \
--nway 6 \
--run_tag test \
--experiment test
```

### Indexing 

Since PLAID-X is a passage retrieval engine, you need to create passage collections if you are intended to search a document collection.
The following command creates a passage collection for the NeuCLIR1 Chinese corpus (file implicitly downloaded from Huggingface). 

```bash
python -m colbert.scripts.collection_utils create_passage_collection \
--root ./test_coll/ --corpus neuclir/neuclir1:data/zho-00000-of-00001.jsonl.gz
```

The indexing processes is broken into **three** steps. 
This is changed from the last version where we have two steps and also different from the original Stanford codebase where they combines everything into one Python call.
Separating the steps provides better allocation for the computation resources and avoid bad GPU reservation deadlocks between Pytorch and FAISS.

```bash
for step in prepare encode finalize; do
python -m colbert.scripts.index \
--coll_dir ./test_coll \
--index_name test_index \
--dataset_name test_coll \
--nbits 1 \
--step $step \
--checkpoint eugene-yang/plaidx-xlmr-large-mlir-neuclir \
--experiment test 
done
```
Note that the `--checkpoint` flag accept ColBERT-X and ColBERT models stored on Huggingface Models.

### Searching 

Finally, the following command searches the collection with a query `.tsv` file where the first column is the query id and the second column contains the query text. 

```bash
python -m colbert.scripts.search \
--index_name neuclir-zho.1bits \
--passage_mapping ./test_coll/mapping.tsv \
--query_file query.tsv \
--metrics nDCG@20 MAP R@100 R@1000 Judged@10 \
--qrel qrels.txt \
--experiment test
```

## Citation and Credit

Please cite the following paper if you use the CLIR generalization of ColBERT.
```bibtex
@inproceedings{ecir2022colbert-x,
	author = {Suraj Nair and Eugene Yang and Dawn Lawrie and Kevin Duh and Paul McNamee and Kenton Murray and James Mayfield and Douglas W. Oard},
	title = {Transfer Learning Approaches for Building Cross-Language Dense Retrieval Models},
	booktitle = {Proceedings of the 44th European Conference on Information Retrieval (ECIR)},
	year = {2022},
	url = {https://arxiv.org/abs/2201.08471}
}
```

Please cite the following paper if you use the **MLIR** generalization. 
```bibtex
@inproceedings{ecir2023mlir,
	title = {Neural Approaches to Multilingual Information Retrieval},
	author = {Dawn Lawrie and Eugene Yang and Douglas W Oard and James Mayfield},
	booktitle = {Proceedings of the 45th European Conference on Information Retrieval (ECIR)},
	year = {2023},
	url = {https://arxiv.org/abs/2209.01335}
}
```

Please cite the following paper if you use the PLAID-X updated implemention or the translate-distil capability of the codebase. 
```bibtex
@inproceedings{ecir2024translate-distill,
  author = {Eugene Yang and Dawn Lawrie and James Mayfield and Douglas W. Oard and Scott Miller},
  title = {Translate-Distill: Learning Cross-Language Dense Retrieval by Translation and Distillation},
  booktitle = {Proceedings of the 46th European Conference on Information Retrieval (ECIR)},
  year = {2024},
  url = {https://arxiv.org/abs/2401.04810}
}
```

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/hltcoe/ColBERT-X/tree/plaid-x",
    "name": "PLAID-X",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "",
    "author": "Eugene Yang",
    "author_email": "eugene.yang@jhu.edu",
    "download_url": "https://files.pythonhosted.org/packages/15/fc/23e3bdebe135f67fc98493e9810b77bef1bbc437b80231a9a2200ad31f96/PLAID-X-0.3.1.tar.gz",
    "platform": null,
    "description": "# PLAID-X\n\nThis is a generalized version of [PLAID](https://github.com/stanford-futuredata/ColBERT) and the previous ColBERT-X for CLIR.\nThe codebase supports models trained with the original ColBERT-X scripts, which are not compatible with the PLAID codebase released from the Stadford Futuredata Group. \n\n## Installation\n\nPLAID-X is available on PyPi. You can install it through\n```bash\npip install PLAID-X\n```\n\nMake sure your gcc and gxx version is `>=9.4.0`, which is the requirement for `ninja` to work properly.\nWe recommand using a `conda` environment to control it.\n\n## Usage\n\nWe have published a [tutorial](https://github.com/hltcoe/clir-tutorial) on CLIR with notebooks to run various models. \nPlease refer to the [PLAID-X notebook](https://colab.research.google.com/github/hltcoe/clir-tutorial/blob/main/notebooks/clir_tutorial_plaidx.ipynb) there for a simple working example in Python. \n\nThe following provides a series of CLI commands for running a larger scale. \n\n### Training Models\n\nThe following command starts the training process using the `t53b-monot5-msmarco-engeng.jsonl.gz` triple file on the Huggingface Dataset repository [`hltcoe/tdist-msmarco-scores`](https://huggingface.co/datasets/hltcoe/tdist-msmarco-scores) with English queries and translated Chinese passages from [neuMARCO](https://ir-datasets.com/neumarco.html).\n\n```bash\npython -m colbert.scripts.train \\\n--model_name xlm-roberta-large \\\n--training_triples hltcoe/tdist-msmarco-scores:t53b-monot5-msmarco-engeng.jsonl.gz \\\n--training_irds_id neumarco/zh/train \\\n--maxsteps 200000 \\\n--learning_rate 5e-6 \\\n--kd_loss KLD \\\n--only_top \\\n--per_device_batch_size 8 \\\n--nway 6 \\\n--run_tag test \\\n--experiment test\n```\n\n### Indexing \n\nSince PLAID-X is a passage retrieval engine, you need to create passage collections if you are intended to search a document collection.\nThe following command creates a passage collection for the NeuCLIR1 Chinese corpus (file implicitly downloaded from Huggingface). \n\n```bash\npython -m colbert.scripts.collection_utils create_passage_collection \\\n--root ./test_coll/ --corpus neuclir/neuclir1:data/zho-00000-of-00001.jsonl.gz\n```\n\nThe indexing processes is broken into **three** steps. \nThis is changed from the last version where we have two steps and also different from the original Stanford codebase where they combines everything into one Python call.\nSeparating the steps provides better allocation for the computation resources and avoid bad GPU reservation deadlocks between Pytorch and FAISS.\n\n```bash\nfor step in prepare encode finalize; do\npython -m colbert.scripts.index \\\n--coll_dir ./test_coll \\\n--index_name test_index \\\n--dataset_name test_coll \\\n--nbits 1 \\\n--step $step \\\n--checkpoint eugene-yang/plaidx-xlmr-large-mlir-neuclir \\\n--experiment test \ndone\n```\nNote that the `--checkpoint` flag accept ColBERT-X and ColBERT models stored on Huggingface Models.\n\n### Searching \n\nFinally, the following command searches the collection with a query `.tsv` file where the first column is the query id and the second column contains the query text. \n\n```bash\npython -m colbert.scripts.search \\\n--index_name neuclir-zho.1bits \\\n--passage_mapping ./test_coll/mapping.tsv \\\n--query_file query.tsv \\\n--metrics nDCG@20 MAP R@100 R@1000 Judged@10 \\\n--qrel qrels.txt \\\n--experiment test\n```\n\n## Citation and Credit\n\nPlease cite the following paper if you use the CLIR generalization of ColBERT.\n```bibtex\n@inproceedings{ecir2022colbert-x,\n\tauthor = {Suraj Nair and Eugene Yang and Dawn Lawrie and Kevin Duh and Paul McNamee and Kenton Murray and James Mayfield and Douglas W. Oard},\n\ttitle = {Transfer Learning Approaches for Building Cross-Language Dense Retrieval Models},\n\tbooktitle = {Proceedings of the 44th European Conference on Information Retrieval (ECIR)},\n\tyear = {2022},\n\turl = {https://arxiv.org/abs/2201.08471}\n}\n```\n\nPlease cite the following paper if you use the **MLIR** generalization. \n```bibtex\n@inproceedings{ecir2023mlir,\n\ttitle = {Neural Approaches to Multilingual Information Retrieval},\n\tauthor = {Dawn Lawrie and Eugene Yang and Douglas W Oard and James Mayfield},\n\tbooktitle = {Proceedings of the 45th European Conference on Information Retrieval (ECIR)},\n\tyear = {2023},\n\turl = {https://arxiv.org/abs/2209.01335}\n}\n```\n\nPlease cite the following paper if you use the PLAID-X updated implemention or the translate-distil capability of the codebase. \n```bibtex\n@inproceedings{ecir2024translate-distill,\n  author = {Eugene Yang and Dawn Lawrie and James Mayfield and Douglas W. Oard and Scott Miller},\n  title = {Translate-Distill: Learning Cross-Language Dense Retrieval by Translation and Distillation},\n  booktitle = {Proceedings of the 46th European Conference on Information Retrieval (ECIR)},\n  year = {2024},\n  url = {https://arxiv.org/abs/2401.04810}\n}\n```\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Efficient and Effective Passage Search via Contextualized Late Interaction over BERT and XLM-RoBERTa",
    "version": "0.3.1",
    "project_urls": {
        "Homepage": "https://github.com/hltcoe/ColBERT-X/tree/plaid-x"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8dd3fcb2a45af7a44d6de4ffcebae7bf7e260969867489432772f1ebe6aba9c3",
                "md5": "2bc66662d67a116491d63991a6028b0c",
                "sha256": "cc887c0d70632d61a9e7bfe43becc493c0ffb50e5af25db4c11ef4def867ac53"
            },
            "downloads": -1,
            "filename": "PLAID_X-0.3.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "2bc66662d67a116491d63991a6028b0c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 107486,
            "upload_time": "2024-02-27T23:25:55",
            "upload_time_iso_8601": "2024-02-27T23:25:55.731705Z",
            "url": "https://files.pythonhosted.org/packages/8d/d3/fcb2a45af7a44d6de4ffcebae7bf7e260969867489432772f1ebe6aba9c3/PLAID_X-0.3.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "15fc23e3bdebe135f67fc98493e9810b77bef1bbc437b80231a9a2200ad31f96",
                "md5": "995f8a1a365eec4988f4e8e6db2188c5",
                "sha256": "83faf8ec8cf01341686ca342440d93877f5093bad5e640b5c8376479bf979e33"
            },
            "downloads": -1,
            "filename": "PLAID-X-0.3.1.tar.gz",
            "has_sig": false,
            "md5_digest": "995f8a1a365eec4988f4e8e6db2188c5",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 80261,
            "upload_time": "2024-02-27T23:25:57",
            "upload_time_iso_8601": "2024-02-27T23:25:57.499774Z",
            "url": "https://files.pythonhosted.org/packages/15/fc/23e3bdebe135f67fc98493e9810b77bef1bbc437b80231a9a2200ad31f96/PLAID-X-0.3.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-27 23:25:57",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "hltcoe",
    "github_project": "ColBERT-X",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "plaid-x"
}

Eugene Yang