ir-datasets

Name	ir-datasets JSON
Version	0.5.9 JSON
	download
home_page	https://github.com/allenai/ir_datasets
Summary	provides a common interface to many IR ad-hoc ranking benchmarks, training datasets, etc.
upload_time	2024-11-08 11:01:05
maintainer	None
docs_url	None
author	Sean MacAvaney
requires_python	>=3.8
license	None
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # ir_datasets

`ir_datasets` is a python package that provides a common interface to many IR ad-hoc ranking
benchmarks, training datasets, etc.

The package takes care of downloading datasets (including documents, queries, relevance judgments,
etc.) when available from public sources. Instructions on how to obtain datasets are provided when
they are not publicly available.

`ir_datasets` provides a common iterator format to allow them to be easily used in python. It
attempts to provide the data in an unaltered form (i.e., keeping all fields and markup), while
handling differences in file formats, encoding, etc. Adapters provide extra functionality, e.g., to
allow quick lookups of documents by ID.

A command line interface is also available.

You can find a list of datasets and their features [here](https://ir-datasets.com/).
Want a new dataset, added functionality, or a bug fixed? Feel free to post an issue or make a pull request! 

## Getting Started

For a quick start with the Python API, check out our Colab tutorials:
[Python](https://colab.research.google.com/github/allenai/ir_datasets/blob/master/examples/ir_datasets.ipynb)
[Command Line](https://colab.research.google.com/github/allenai/ir_datasets/blob/master/examples/ir_datasets_cli.ipynb)

Install via pip:

```
pip install ir_datasets
```

If you want the main branch, you install as such:

```
pip install git+https://github.com/allenai/ir_datasets.git
```

If you want to build from source, use:

```
$ git clone https://github.com/allenai/ir_datasets
$ cd ir_datasets
$ python setup.py bdist_wheel
$ pip install dist/ir_datasets-*.whl
```

Tested with python versions 3.7, 3.8, 3.9, and 3.10. (Mininum python version is 3.7.)

## Features

**Python and Command Line Interfaces**. Access datasts both through a simple Python API and
via the command line.

```python
import ir_datasets
dataset = ir_datasets.load('msmarco-passage/train')
# Documents
for doc in dataset.docs_iter():
    print(doc)
# GenericDoc(doc_id='0', text='The presence of communication amid scientific minds was equa...
# GenericDoc(doc_id='1', text='The Manhattan Project and its atomic bomb helped bring an en...
# ...
```

```bash
ir_datasets export msmarco-passage/train docs | head -n2
0 The presence of communication amid scientific minds was equally important to the success of the Manh...
1 The Manhattan Project and its atomic bomb helped bring an end to World War II. Its legacy of peacefu...
```

**Automatically downloads source files** (when available). Will download and verify the source
files for queries, documents, qrels, etc. when they are publicly available, as they are needed.
A CI build checks weekly to ensure that all the downloadable content is available and correct:
[![Downloadable Content](https://github.com/seanmacavaney/ir-datasets.com/actions/workflows/verify_downloads.yml/badge.svg)](https://github.com/seanmacavaney/ir-datasets.com/actions/workflows/verify_downloads.yml).
We mirror some troublesome files on [mirror.ir-datasets.com](https://mirror.ir-datasets.com/), and
automatically switch to the mirror when the original source is not available.

```python
import ir_datasets
dataset = ir_datasets.load('msmarco-passage/train')
for doc in dataset.docs_iter(): # Will download and extract MS-MARCO's collection.tar.gz the first time
    ...
for query in dataset.queries_iter(): # Will download and extract MS-MARCO's queries.tar.gz the first time
    ...
```

**Instructions for dataset access** (when not publicly available). Provides instructions on how
to get a copy of the data when it is not publicly available online (e.g., when it requires a
data usage agreement).

```python
import ir_datasets
dataset = ir_datasets.load('trec-arabic')
for doc in dataset.docs_iter():
    ...
# Provides the following instructions:
# The dataset is based on the Arabic Newswire corpus. It is available from the LDC via: <https://catalog.ldc.upenn.edu/LDC2001T55>
# To proceed, symlink the source file here: [gives path]
```

**Support for datasets big and small**. By using iterators, supports large datasets that may
not fit into system memory, such as ClueWeb.

```python
import ir_datasets
dataset = ir_datasets.load('clueweb09')
for doc in dataset.docs_iter():
    ... # will iterate through all ~1B documents
```

**Fixes known dataset issues**. For instance, automatically corrects the document UTF-8 encoding
problem in the MS-MARCO passage collection.

```python
import ir_datasets
dataset = ir_datasets.load('msmarco-passage')
docstore = dataset.docs_store()
docstore.get('243').text
# "John Maynard Keynes, 1st Baron Keynes, CB, FBA (/ˈkeɪnz/ KAYNZ; 5 June 1883 – 21 April [SNIP]"
# Naïve UTF-8 decoding yields double-encoding artifacts like:
# "John Maynard Keynes, 1st Baron Keynes, CB, FBA (/Ë\x88keÉªnz/ KAYNZ; 5 June 1883 â\x80\x93 21 April [SNIP]"
#                                                  ~~~~~~  ~~                       ~~~~~~~~~
```

**Fast Random Document Access.** Builds data structures that allow fast and efficient lookup of
document content. For large datasets, such as ClueWeb, uses
[checkpoint files](https://ir-datasets.com/clueweb_warc_checkpoints.md) to load documents from
source 40x faster than normal. Results are cached for even faster subsequent accesses.

```python
import ir_datasets
dataset = ir_datasets.load('clueweb12')
docstore = dataset.docs_store()
docstore.get_many(['clueweb12-0000tw-05-00014', 'clueweb12-0000tw-05-12119', 'clueweb12-0106wb-18-19516'])
# {'clueweb12-0000tw-05-00014': ..., 'clueweb12-0000tw-05-12119': ..., 'clueweb12-0106wb-18-19516': ...}
```

**Fancy Iter Slicing.** Sometimes it's helpful to be able to select ranges of data (e.g., for processing
document collections in parallel on multiple devices). Efficient implementations of slicing operations
allow for much faster dataset partitioning than using `itertools.slice`.

```python
import ir_datasets
dataset = ir_datasets.load('clueweb12')
dataset.docs_iter()[500:1000] # normal slicing behavior
# WarcDoc(doc_id='clueweb12-0000tw-00-00502', ...), WarcDoc(doc_id='clueweb12-0000tw-00-00503', ...), ...
dataset.docs_iter()[-10:-8] # includes negative indexing
# WarcDoc(doc_id='clueweb12-1914wb-28-24245', ...), WarcDoc(doc_id='clueweb12-1914wb-28-24246', ...)
dataset.docs_iter()[::100] # includes support for skip (only positive values)
# WarcDoc(doc_id='clueweb12-0000tw-00-00000', ...), WarcDoc(doc_id='clueweb12-0000tw-00-00100', ...), ...
dataset.docs_iter()[1/3:2/3] # supports proportional slicing (this takes the middle third of the collection)
# WarcDoc(doc_id='clueweb12-0605wb-28-12714', ...), WarcDoc(doc_id='clueweb12-0605wb-28-12715', ...), ...
```

## Datasets

Available datasets include:
 - [ANTIQUE](https://ir-datasets.com/antique.html)
 - [AQUAINT](https://ir-datasets.com/aquaint.html)
 - [BEIR (benchmark suite)](https://ir-datasets.com/beir.html)
 - [TREC CAR](https://ir-datasets.com/car.html)
 - [C4](https://ir-datasets.com/c4.html)
 - [ClueWeb09](https://ir-datasets.com/clueweb09.html)
 - [ClueWeb12](https://ir-datasets.com/clueweb12.html)
 - [CLIRMatrix](https://ir-datasets.com/clirmatrix.html)
 - [CodeSearchNet](https://ir-datasets.com/codesearchnet.html)
 - [CORD-19](https://ir-datasets.com/cord19.html)
 - [DPR Wiki100](https://ir-datasets.com/dpr-w100.html)
 - [GOV](https://ir-datasets.com/gov.html)
 - [GOV2](https://ir-datasets.com/gov2.html)
 - [HC4](https://ir-datasets.com/hc4.html)
 - [Highwire (TREC Genomics 2006-07)](https://ir-datasets.com/highwire.html)
 - [Medline](https://ir-datasets.com/medline.html)
 - [MSMARCO (document)](https://ir-datasets.com/msmarco-document.html)
 - [MSMARCO (passage)](https://ir-datasets.com/msmarco-passage.html)
 - [MSMARCO (QnA)](https://ir-datasets.com/msmarco-qna.html)
 - [Natural Questions](https://ir-datasets.com/natural-questions.html)
 - [NFCorpus (NutritionFacts)](https://ir-datasets.com/nfcorpus.html)
 - [NYT](https://ir-datasets.com/nyt.html)
 - [PubMed Central (TREC CDS)](https://ir-datasets.com/pmc.html)
 - [TREC Arabic](https://ir-datasets.com/trec-arabic.html)
 - [TREC Fair Ranking 2021](https://ir-datasets.com/trec-fair-2021.html)
 - [TREC Mandarin](https://ir-datasets.com/trec-mandarin.html)
 - [TREC Robust 2004](https://ir-datasets.com/trec-robust04.html)
 - [TREC Spanish](https://ir-datasets.com/trec-spanish.html)
 - [TripClick](https://ir-datasets.com/tripclick.html)
 - [Tweets 2013 (Internet Archive)](https://ir-datasets.com/tweets2013-ia.html)
 - [Vaswani](https://ir-datasets.com/vaswani.html)
 - [Washington Post](https://ir-datasets.com/wapo.html)
 - [WikIR](https://ir-datasets.com/wikir.html)

There are "subsets" under each dataset. For instance, `clueweb12/b13/trec-misinfo-2019` provides the
queries and judgments from the [2019 TREC misinformation track](https://trec.nist.gov/data/misinfo2019.html),
and `msmarco-document/orcas` provides the [ORCAS dataset](https://microsoft.github.io/msmarco/ORCAS). They
tend to be organized with the document collection at the top level.

See the ir_dataets docs ([ir_datasets.com](https://ir-datasets.com/)) for details about each
dataset, its available subsets, and what data they provide.

## Environment variables

 - `IR_DATASETS_HOME`: Home directory for ir_datasets data (default `~/.ir_datasets/`). Contains directories
   for each top-level dataset.
 - `IR_DATASETS_TMP`: Temporary working directory (default `/tmp/ir_datasets/`).
 - `IR_DATASETS_DL_TIMEOUT`: Download stream read timeout, in seconds (default `15`). If no data is received
   within this duration, the connection will be assumed to be dead, and another download may be attempted.
 - `IR_DATASETS_DL_TRIES`: Default number of download attempts before exception is thrown (default `3`).
   When the server accepts Range requests, uses them. Otherwise, will download the entire file again
 - `IR_DATASETS_DL_DISABLE_PBAR`: Set to `true` to disable the progress bar for downloads. Useful in settings
   where an interactive console is not available.
 - `IR_DATASETS_DL_SKIP_SSL`: Set to `true` to disable checking SSL certificates when downloading files.
   Useful as a short-term solution when SSL certificates expire or are otherwise invalid. Note that this
   does not disable hash verification of the downloaded content.
 - `IR_DATASETS_SKIP_DISK_FREE`: Set to `true` to disable checks for enough free space on disk before
   downloading content or otherwise creating large files.
 - `IR_DATASETS_SMALL_FILE_SIZE`: The size of files that are considered "small", in bytes. Instructions for
   linking small files rather then downloading them are not shown. Defaults to 5000000 (5MB).

## Citing

When using datasets provided by this package, be sure to properly cite them. Bibtex for each dataset
can be found on the [datasets documentation page](https://ir-datasets.com/).

If you use this tool, please cite [our SIGIR resource paper](https://arxiv.org/pdf/2103.02280.pdf):

```
@inproceedings{macavaney:sigir2021-irds,
  author = {MacAvaney, Sean and Yates, Andrew and Feldman, Sergey and Downey, Doug and Cohan, Arman and Goharian, Nazli},
  title = {Simplified Data Wrangling with ir_datasets},
  year = {2021},
  booktitle = {SIGIR}
}
```

## Credits

Contributors to this repository:

 - Sean MacAvaney (University of Glasgow)
 - Shuo Sun (Johns Hopkins University)
 - Thomas Jänich (University of Glasgow)
 - Jan Heinrich Reimer (Martin Luther University Halle-Wittenberg)
 - Maik Fröbe (Martin Luther University Halle-Wittenberg)
 - Eugene Yang (Johns Hopkins University)
 - Augustin Godinot (NAVERLABS Europe, ENS Paris-Saclay)

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/allenai/ir_datasets",
    "name": "ir-datasets",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": null,
    "author": "Sean MacAvaney",
    "author_email": "sean.macavaney@glasgow.ac.uk",
    "download_url": "https://files.pythonhosted.org/packages/a0/12/0f99bbd93c62b183d94b7b68ef570dae0cbc64b14e381e26d52aaa2f4827/ir_datasets-0.5.9.tar.gz",
    "platform": null,
    "description": "# ir_datasets\n\n`ir_datasets` is a python package that provides a common interface to many IR ad-hoc ranking\nbenchmarks, training datasets, etc.\n\nThe package takes care of downloading datasets (including documents, queries, relevance judgments,\netc.) when available from public sources. Instructions on how to obtain datasets are provided when\nthey are not publicly available.\n\n`ir_datasets` provides a common iterator format to allow them to be easily used in python. It\nattempts to provide the data in an unaltered form (i.e., keeping all fields and markup), while\nhandling differences in file formats, encoding, etc. Adapters provide extra functionality, e.g., to\nallow quick lookups of documents by ID.\n\nA command line interface is also available.\n\nYou can find a list of datasets and their features [here](https://ir-datasets.com/).\nWant a new dataset, added functionality, or a bug fixed? Feel free to post an issue or make a pull request! \n\n## Getting Started\n\nFor a quick start with the Python API, check out our Colab tutorials:\n[Python](https://colab.research.google.com/github/allenai/ir_datasets/blob/master/examples/ir_datasets.ipynb)\n[Command Line](https://colab.research.google.com/github/allenai/ir_datasets/blob/master/examples/ir_datasets_cli.ipynb)\n\nInstall via pip:\n\n```\npip install ir_datasets\n```\n\nIf you want the main branch, you install as such:\n\n```\npip install git+https://github.com/allenai/ir_datasets.git\n```\n\nIf you want to build from source, use:\n\n```\n$ git clone https://github.com/allenai/ir_datasets\n$ cd ir_datasets\n$ python setup.py bdist_wheel\n$ pip install dist/ir_datasets-*.whl\n```\n\nTested with python versions 3.7, 3.8, 3.9, and 3.10. (Mininum python version is 3.7.)\n\n## Features\n\n**Python and Command Line Interfaces**. Access datasts both through a simple Python API and\nvia the command line.\n\n```python\nimport ir_datasets\ndataset = ir_datasets.load('msmarco-passage/train')\n# Documents\nfor doc in dataset.docs_iter():\n    print(doc)\n# GenericDoc(doc_id='0', text='The presence of communication amid scientific minds was equa...\n# GenericDoc(doc_id='1', text='The Manhattan Project and its atomic bomb helped bring an en...\n# ...\n```\n\n```bash\nir_datasets export msmarco-passage/train docs | head -n2\n0 The presence of communication amid scientific minds was equally important to the success of the Manh...\n1 The Manhattan Project and its atomic bomb helped bring an end to World War II. Its legacy of peacefu...\n```\n\n**Automatically downloads source files** (when available). Will download and verify the source\nfiles for queries, documents, qrels, etc. when they are publicly available, as they are needed.\nA CI build checks weekly to ensure that all the downloadable content is available and correct:\n[![Downloadable Content](https://github.com/seanmacavaney/ir-datasets.com/actions/workflows/verify_downloads.yml/badge.svg)](https://github.com/seanmacavaney/ir-datasets.com/actions/workflows/verify_downloads.yml).\nWe mirror some troublesome files on [mirror.ir-datasets.com](https://mirror.ir-datasets.com/), and\nautomatically switch to the mirror when the original source is not available.\n\n```python\nimport ir_datasets\ndataset = ir_datasets.load('msmarco-passage/train')\nfor doc in dataset.docs_iter(): # Will download and extract MS-MARCO's collection.tar.gz the first time\n    ...\nfor query in dataset.queries_iter(): # Will download and extract MS-MARCO's queries.tar.gz the first time\n    ...\n```\n\n**Instructions for dataset access** (when not publicly available). Provides instructions on how\nto get a copy of the data when it is not publicly available online (e.g., when it requires a\ndata usage agreement).\n\n```python\nimport ir_datasets\ndataset = ir_datasets.load('trec-arabic')\nfor doc in dataset.docs_iter():\n    ...\n# Provides the following instructions:\n# The dataset is based on the Arabic Newswire corpus. It is available from the LDC via: <https://catalog.ldc.upenn.edu/LDC2001T55>\n# To proceed, symlink the source file here: [gives path]\n```\n\n**Support for datasets big and small**. By using iterators, supports large datasets that may\nnot fit into system memory, such as ClueWeb.\n\n```python\nimport ir_datasets\ndataset = ir_datasets.load('clueweb09')\nfor doc in dataset.docs_iter():\n    ... # will iterate through all ~1B documents\n```\n\n**Fixes known dataset issues**. For instance, automatically corrects the document UTF-8 encoding\nproblem in the MS-MARCO passage collection.\n\n```python\nimport ir_datasets\ndataset = ir_datasets.load('msmarco-passage')\ndocstore = dataset.docs_store()\ndocstore.get('243').text\n# \"John Maynard Keynes, 1st Baron Keynes, CB, FBA (/\u02c8ke\u026anz/ KAYNZ; 5 June 1883 \u2013 21 April [SNIP]\"\n# Na\u00efve UTF-8 decoding yields double-encoding artifacts like:\n# \"John Maynard Keynes, 1st Baron Keynes, CB, FBA (/\u00cb\\x88ke\u00c9\u00aanz/ KAYNZ; 5 June 1883 \u00e2\\x80\\x93 21 April [SNIP]\"\n#                                                  ~~~~~~  ~~                       ~~~~~~~~~\n```\n\n**Fast Random Document Access.** Builds data structures that allow fast and efficient lookup of\ndocument content. For large datasets, such as ClueWeb, uses\n[checkpoint files](https://ir-datasets.com/clueweb_warc_checkpoints.md) to load documents from\nsource 40x faster than normal. Results are cached for even faster subsequent accesses.\n\n```python\nimport ir_datasets\ndataset = ir_datasets.load('clueweb12')\ndocstore = dataset.docs_store()\ndocstore.get_many(['clueweb12-0000tw-05-00014', 'clueweb12-0000tw-05-12119', 'clueweb12-0106wb-18-19516'])\n# {'clueweb12-0000tw-05-00014': ..., 'clueweb12-0000tw-05-12119': ..., 'clueweb12-0106wb-18-19516': ...}\n```\n\n**Fancy Iter Slicing.** Sometimes it's helpful to be able to select ranges of data (e.g., for processing\ndocument collections in parallel on multiple devices). Efficient implementations of slicing operations\nallow for much faster dataset partitioning than using `itertools.slice`.\n\n```python\nimport ir_datasets\ndataset = ir_datasets.load('clueweb12')\ndataset.docs_iter()[500:1000] # normal slicing behavior\n# WarcDoc(doc_id='clueweb12-0000tw-00-00502', ...), WarcDoc(doc_id='clueweb12-0000tw-00-00503', ...), ...\ndataset.docs_iter()[-10:-8] # includes negative indexing\n# WarcDoc(doc_id='clueweb12-1914wb-28-24245', ...), WarcDoc(doc_id='clueweb12-1914wb-28-24246', ...)\ndataset.docs_iter()[::100] # includes support for skip (only positive values)\n# WarcDoc(doc_id='clueweb12-0000tw-00-00000', ...), WarcDoc(doc_id='clueweb12-0000tw-00-00100', ...), ...\ndataset.docs_iter()[1/3:2/3] # supports proportional slicing (this takes the middle third of the collection)\n# WarcDoc(doc_id='clueweb12-0605wb-28-12714', ...), WarcDoc(doc_id='clueweb12-0605wb-28-12715', ...), ...\n```\n\n## Datasets\n\nAvailable datasets include:\n - [ANTIQUE](https://ir-datasets.com/antique.html)\n - [AQUAINT](https://ir-datasets.com/aquaint.html)\n - [BEIR (benchmark suite)](https://ir-datasets.com/beir.html)\n - [TREC CAR](https://ir-datasets.com/car.html)\n - [C4](https://ir-datasets.com/c4.html)\n - [ClueWeb09](https://ir-datasets.com/clueweb09.html)\n - [ClueWeb12](https://ir-datasets.com/clueweb12.html)\n - [CLIRMatrix](https://ir-datasets.com/clirmatrix.html)\n - [CodeSearchNet](https://ir-datasets.com/codesearchnet.html)\n - [CORD-19](https://ir-datasets.com/cord19.html)\n - [DPR Wiki100](https://ir-datasets.com/dpr-w100.html)\n - [GOV](https://ir-datasets.com/gov.html)\n - [GOV2](https://ir-datasets.com/gov2.html)\n - [HC4](https://ir-datasets.com/hc4.html)\n - [Highwire (TREC Genomics 2006-07)](https://ir-datasets.com/highwire.html)\n - [Medline](https://ir-datasets.com/medline.html)\n - [MSMARCO (document)](https://ir-datasets.com/msmarco-document.html)\n - [MSMARCO (passage)](https://ir-datasets.com/msmarco-passage.html)\n - [MSMARCO (QnA)](https://ir-datasets.com/msmarco-qna.html)\n - [Natural Questions](https://ir-datasets.com/natural-questions.html)\n - [NFCorpus (NutritionFacts)](https://ir-datasets.com/nfcorpus.html)\n - [NYT](https://ir-datasets.com/nyt.html)\n - [PubMed Central (TREC CDS)](https://ir-datasets.com/pmc.html)\n - [TREC Arabic](https://ir-datasets.com/trec-arabic.html)\n - [TREC Fair Ranking 2021](https://ir-datasets.com/trec-fair-2021.html)\n - [TREC Mandarin](https://ir-datasets.com/trec-mandarin.html)\n - [TREC Robust 2004](https://ir-datasets.com/trec-robust04.html)\n - [TREC Spanish](https://ir-datasets.com/trec-spanish.html)\n - [TripClick](https://ir-datasets.com/tripclick.html)\n - [Tweets 2013 (Internet Archive)](https://ir-datasets.com/tweets2013-ia.html)\n - [Vaswani](https://ir-datasets.com/vaswani.html)\n - [Washington Post](https://ir-datasets.com/wapo.html)\n - [WikIR](https://ir-datasets.com/wikir.html)\n\nThere are \"subsets\" under each dataset. For instance, `clueweb12/b13/trec-misinfo-2019` provides the\nqueries and judgments from the [2019 TREC misinformation track](https://trec.nist.gov/data/misinfo2019.html),\nand `msmarco-document/orcas` provides the [ORCAS dataset](https://microsoft.github.io/msmarco/ORCAS). They\ntend to be organized with the document collection at the top level.\n\nSee the ir_dataets docs ([ir_datasets.com](https://ir-datasets.com/)) for details about each\ndataset, its available subsets, and what data they provide.\n\n## Environment variables\n\n - `IR_DATASETS_HOME`: Home directory for ir_datasets data (default `~/.ir_datasets/`). Contains directories\n   for each top-level dataset.\n - `IR_DATASETS_TMP`: Temporary working directory (default `/tmp/ir_datasets/`).\n - `IR_DATASETS_DL_TIMEOUT`: Download stream read timeout, in seconds (default `15`). If no data is received\n   within this duration, the connection will be assumed to be dead, and another download may be attempted.\n - `IR_DATASETS_DL_TRIES`: Default number of download attempts before exception is thrown (default `3`).\n   When the server accepts Range requests, uses them. Otherwise, will download the entire file again\n - `IR_DATASETS_DL_DISABLE_PBAR`: Set to `true` to disable the progress bar for downloads. Useful in settings\n   where an interactive console is not available.\n - `IR_DATASETS_DL_SKIP_SSL`: Set to `true` to disable checking SSL certificates when downloading files.\n   Useful as a short-term solution when SSL certificates expire or are otherwise invalid. Note that this\n   does not disable hash verification of the downloaded content.\n - `IR_DATASETS_SKIP_DISK_FREE`: Set to `true` to disable checks for enough free space on disk before\n   downloading content or otherwise creating large files.\n - `IR_DATASETS_SMALL_FILE_SIZE`: The size of files that are considered \"small\", in bytes. Instructions for\n   linking small files rather then downloading them are not shown. Defaults to 5000000 (5MB).\n\n## Citing\n\nWhen using datasets provided by this package, be sure to properly cite them. Bibtex for each dataset\ncan be found on the [datasets documentation page](https://ir-datasets.com/).\n\nIf you use this tool, please cite [our SIGIR resource paper](https://arxiv.org/pdf/2103.02280.pdf):\n\n```\n@inproceedings{macavaney:sigir2021-irds,\n  author = {MacAvaney, Sean and Yates, Andrew and Feldman, Sergey and Downey, Doug and Cohan, Arman and Goharian, Nazli},\n  title = {Simplified Data Wrangling with ir_datasets},\n  year = {2021},\n  booktitle = {SIGIR}\n}\n```\n\n## Credits\n\nContributors to this repository:\n\n - Sean MacAvaney (University of Glasgow)\n - Shuo Sun (Johns Hopkins University)\n - Thomas J\u00e4nich (University of Glasgow)\n - Jan Heinrich Reimer (Martin Luther University Halle-Wittenberg)\n - Maik Fr\u00f6be (Martin Luther University Halle-Wittenberg)\n - Eugene Yang (Johns Hopkins University)\n - Augustin Godinot (NAVERLABS Europe, ENS Paris-Saclay)\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "provides a common interface to many IR ad-hoc ranking benchmarks, training datasets, etc.",
    "version": "0.5.9",
    "project_urls": {
        "Homepage": "https://github.com/allenai/ir_datasets"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1f7d14194ad38c5ad4a96f79a7aa1da97c2e8796c22d15ba1bfafcfe8948d49f",
                "md5": "089a36347214bb1824cb9ed6647cd6c6",
                "sha256": "07c9bed07f31031f1da1bc02afc7a1077b1179a3af402d061f83bf6fb833b90a"
            },
            "downloads": -1,
            "filename": "ir_datasets-0.5.9-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "089a36347214bb1824cb9ed6647cd6c6",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 347928,
            "upload_time": "2024-11-08T11:01:07",
            "upload_time_iso_8601": "2024-11-08T11:01:07.407159Z",
            "url": "https://files.pythonhosted.org/packages/1f/7d/14194ad38c5ad4a96f79a7aa1da97c2e8796c22d15ba1bfafcfe8948d49f/ir_datasets-0.5.9-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a0120f99bbd93c62b183d94b7b68ef570dae0cbc64b14e381e26d52aaa2f4827",
                "md5": "710e3379d27f75d04c6bdad001ffd313",
                "sha256": "35c90980fbd0f4ea8fe22a1ab16d2bb6be3dc373cbd6dfab1d905f176a70e5ac"
            },
            "downloads": -1,
            "filename": "ir_datasets-0.5.9.tar.gz",
            "has_sig": false,
            "md5_digest": "710e3379d27f75d04c6bdad001ffd313",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 267937,
            "upload_time": "2024-11-08T11:01:05",
            "upload_time_iso_8601": "2024-11-08T11:01:05.028853Z",
            "url": "https://files.pythonhosted.org/packages/a0/12/0f99bbd93c62b183d94b7b68ef570dae0cbc64b14e381e26d52aaa2f4827/ir_datasets-0.5.9.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-08 11:01:05",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "allenai",
    "github_project": "ir_datasets",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "ir-datasets"
}

Sean MacAvaney