air-benchmark


Nameair-benchmark JSON
Version 0.1.0 PyPI version JSON
download
home_pageNone
SummaryAIR-Bench: Automated Heterogeneous Information Retrieval Benchmark
upload_time2024-10-17 09:46:37
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseMIT License Copyright (c) 2024 AIR-Bench Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords embedding benchmark air-bench reranker information retrieval
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <h1 align="center">
<img style="vertical-align:middle" width="640" height="320" src="https://github.com/AIR-Bench/AIR-Bench/blob/main/docs/images/banner.png" />
</h1>

<h4 align="center">
    <p>
        <a href="#%EF%B8%8F-motivation">Motivation</a> |
        <a href="#%EF%B8%8F-features">Features</a> |
        <a href="#%EF%B8%8F-documentation">Documentation</a> |
        <a href="https://huggingface.co/spaces/AIR-Bench/leaderboard">Leaderboard</a> |
        <a href="#%EF%B8%8F-citing">Citing</a>
    <p>
</h4>

<h3 align="center">
    <a href="https://huggingface.co/AIR-Bench"><img style="float: middle; padding: 10px 10px 10px 10px;" width="60" height="55" src="https://github.com/AIR-Bench/AIR-Bench/blob/main/docs/images/hf_logo.png" /></a>
</h3>

## ☁️ Motivation

Evaluation is crucial for the development of information retrieval models. In recent years, a series of milestone works have been introduced to the community, such as [MSMARCO](https://microsoft.github.io/msmarco/), [Natural Question](https://ai.google.com/research/NaturalQuestions) (open-domain QA), [MIRACL](https://github.com/project-miracl/miracl) (multilingual retrieval), [BEIR](https://github.com/beir-cellar/beir/) and [MTEB](https://github.com/embeddings-benchmark/mteb) (general-domain zero-shot retrieval). However, the existing benchmarks are severely limited in the following perspectives.

- **Incapability of dealing with new domains**. All of the existing benchmarks are static, which means they are established for the pre-defined domains based on human labeled data. Therefore, they are incapable of dealing with new domains which are interested by the users. 
- **Potential risk of over-fitting and data leakage**. The existing retrievers are intensively fine-tuned in order to achieve strong performances on popular benchmarks, like BEIR and MTEB. Despite that these benchmarks are initially designed for zero-shot evaluation of O.O.D. Evaluation, the in-domain training data is widely used during the fine-tuning process. What is worse, given the public availability of the existing evaluation datasets, the testing data could be falsely mixed into the retrievers' training set by mistake. 

## ☁️ Features

- 🤖 **Automated**. The testing data is automatically generated by large language models without human intervention. Therefore, it is able to instantly support the evaluation of new domains at a very small cost. Besides, the new testing data is almost impossible to be covered by the training sets of any existing retrievers.
- 🔍 **Retrieval and RAG-oriented**. The new benchmark is dedicated to the evaluation of retrieval performance. In addition to the typical evaluation scenarios, like open-domain question answering or paraphrase retrieval, the new benchmark also incorporates a new setting called inner-document retrieval which is closely related with today's LLM and RAG applications. In this new setting, the model is expected to retrieve the relevant chunks of a very long documents, which contain the critical information to answer the input question. 
- 🔄 **Heterogeneous and Dynamic**. The testing data is generated w.r.t. diverse and constantly augmented domains and languages (i.e. Multi-domain, Multi-lingual). As a result, it is able to provide an increasingly comprehensive evaluation benchmark for the community developers.

## ☁️ Versions

We plan to release new test datasets on regular basis. The latest version is `AIR-Bench_24.05`.

|      Version      | Release Date | # of domains | # of languages | # of datasets |                           Details                            |
| :---------------: | :----------: | :------: | :--------: | :-------: | :----------------------------------------------------------: |
| `AIR-Bench_24.05` | Oct 17, 2024 |    9 <sup>[1]</sup>   |     13 <sup>[2]</sup>    |    69     | [here](https://github.com/AIR-Bench/AIR-Bench/blob/main/docs/available_tasks.md#air-bench_2405) |
| `AIR-Bench_24.04` | May 21, 2024 |    8 <sup>[3]</sup>    |     2 <sup>[4]</sup>     |    28     | [here](https://github.com/AIR-Bench/AIR-Bench/blob/main/docs/available_tasks.md#air-bench_2404) |

> [1] wiki, web, news, healthcare, law, finance, arxiv, book, science.
>
> [2] en, zh, es, fr, de, ru, ja, ko, ar, fa, id, hi, bn (English, Chinese, Spanish, French, German, Russian, Japanese, Korean, Arabic, Persian, Indonesian, Hindi, Bengali).
>
> [3] wiki, web, news, healthcare, law, finance, arxiv, book.
>
> [4] en, zh (English, Chinese).

For the differences between different versions, please refer to [here](https://github.com/AIR-Bench/AIR-Bench/blob/main/docs/available_tasks.md).

## ☁️ Results

You could check out the results at
[AIR-Bench Leaderboard](https://huggingface.co/spaces/AIR-Bench/leaderboard). Detailed results are available in [eval_results](https://huggingface.co/datasets/AIR-Bench/eval_results/tree/main).

Some brief analysis results are available [here](https://github.com/AIR-Bench/AIR-Bench/blob/main/docs/available_analysis_results.md). *The technical report is coming soon*. Please stay tuned for updates!

## ☁️ Usage
### Installation
This repo is used to maintain the codebases for running AIR-Bench evaluation. To run the evaluation, please install `air-benchmark`.

```bash
pip install air-benchmark
```

### Evaluations

Refer to the steps below to run evaluations and submit the results to the leaderboard (refer to [here](https://github.com/AIR-Bench/AIR-Bench/blob/main/docs/submit_to_leaderboard.md) for more detailed information).

1. Run evaluations
    - See the [scripts](https://github.com/AIR-Bench/AIR-Bench/blob/main/scripts) to run evaluations on AIR-Bench for your models.

2. Submit search results (*Only for test set*)
    - Package the output files
      - As for the results without reranking models,

      ```bash
      cd scripts
      python zip_results.py \
      --results_dir search_results \
      --retriever_name [YOUR_RETRIEVAL_MODEL] \
      --save_dir search_results
      ```

      - As for the results with reranking models

      ```bash
      cd scripts
      python zip_results.py \
      --results_dir search_results \
      --retriever_name [YOUR_RETRIEVAL_MODEL] \
      --reranker_name [YOUR_RERANKING_MODEL] \
      --save_dir search_results
      ```

    - Upload the output `.zip` and fill in the model information at [AIR-Bench Leaderboard](https://huggingface.co/spaces/AIR-Bench/leaderboard)

## ☁️ Documentation

| Documentation                                                |                                                           |
| ------------------------------------------------------------ | --------------------------------------------------------- |
| 🏭 [Pipeline](https://github.com/AIR-Bench/AIR-Bench/blob/main/docs/data_generation.md) | The data generation pipeline of AIR-Bench                 |
| 📋 [Tasks](https://github.com/AIR-Bench/AIR-Bench/blob/main/docs/available_tasks.md) | Overview of available tasks in AIR-Bench                  |
| 📈 [Leaderboard](https://huggingface.co/spaces/AIR-Bench/leaderboard) | The interactive leaderboard of AIR-Bench                  |
| 🚀 [Submit](https://github.com/AIR-Bench/AIR-Bench/blob/main/docs/submit_to_leaderboard.md) | Information related to how to submit a model to AIR-Bench |
| 🤝 [Contributing](https://github.com/AIR-Bench/AIR-Bench/blob/main/docs/community_contribution.md) | How to contribute to AIR-Bench                            |


## ☁️ Acknowledgement
This work is inspired by [MTEB](https://github.com/embeddings-benchmark/mteb) and [BEIR](https://github.com/beir-cellar/beir/). Many thanks for the early feedbacks from [@tomaarsen](https://github.com/tomaarsen), [@Muennighoff](https://github.com/Muennighoff), [@takatost](https://github.com/takatost), [@chtlp](https://github.com/chtlp).


## ☁️ Citing

*The technical report is coming soon*. Please stay tuned for updates!

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "air-benchmark",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "embedding, benchmark, air-bench, reranker, information retrieval",
    "author": null,
    "author_email": "BAAI <zhengliu1026@gmail.com>, Jina AI <research@jina.ai>",
    "download_url": "https://files.pythonhosted.org/packages/df/15/dce34be9b2f304880bf132a4da27eff51a71778cd75ff94352cf16cfbeb9/air_benchmark-0.1.0.tar.gz",
    "platform": null,
    "description": "<h1 align=\"center\">\n<img style=\"vertical-align:middle\" width=\"640\" height=\"320\" src=\"https://github.com/AIR-Bench/AIR-Bench/blob/main/docs/images/banner.png\" />\n</h1>\n\n<h4 align=\"center\">\n    <p>\n        <a href=\"#%EF%B8%8F-motivation\">Motivation</a> |\n        <a href=\"#%EF%B8%8F-features\">Features</a> |\n        <a href=\"#%EF%B8%8F-documentation\">Documentation</a> |\n        <a href=\"https://huggingface.co/spaces/AIR-Bench/leaderboard\">Leaderboard</a> |\n        <a href=\"#%EF%B8%8F-citing\">Citing</a>\n    <p>\n</h4>\n\n<h3 align=\"center\">\n    <a href=\"https://huggingface.co/AIR-Bench\"><img style=\"float: middle; padding: 10px 10px 10px 10px;\" width=\"60\" height=\"55\" src=\"https://github.com/AIR-Bench/AIR-Bench/blob/main/docs/images/hf_logo.png\" /></a>\n</h3>\n\n## \u2601\ufe0f Motivation\n\nEvaluation is crucial for the development of information retrieval models. In recent years, a series of milestone works have been introduced to the community, such as [MSMARCO](https://microsoft.github.io/msmarco/), [Natural Question](https://ai.google.com/research/NaturalQuestions) (open-domain QA), [MIRACL](https://github.com/project-miracl/miracl) (multilingual retrieval), [BEIR](https://github.com/beir-cellar/beir/) and [MTEB](https://github.com/embeddings-benchmark/mteb) (general-domain zero-shot retrieval). However, the existing benchmarks are severely limited in the following perspectives.\n\n- **Incapability of dealing with new domains**. All of the existing benchmarks are static, which means they are established for the pre-defined domains based on human labeled data. Therefore, they are incapable of dealing with new domains which are interested by the users. \n- **Potential risk of over-fitting and data leakage**. The existing retrievers are intensively fine-tuned in order to achieve strong performances on popular benchmarks, like BEIR and MTEB. Despite that these benchmarks are initially designed for zero-shot evaluation of O.O.D. Evaluation, the in-domain training data is widely used during the fine-tuning process. What is worse, given the public availability of the existing evaluation datasets, the testing data could be falsely mixed into the retrievers' training set by mistake. \n\n## \u2601\ufe0f Features\n\n- \ud83e\udd16 **Automated**. The testing data is automatically generated by large language models without human intervention. Therefore, it is able to instantly support the evaluation of new domains at a very small cost. Besides, the new testing data is almost impossible to be covered by the training sets of any existing retrievers.\n- \ud83d\udd0d **Retrieval and RAG-oriented**. The new benchmark is dedicated to the evaluation of retrieval performance. In addition to the typical evaluation scenarios, like open-domain question answering or paraphrase retrieval, the new benchmark also incorporates a new setting called inner-document retrieval which is closely related with today's LLM and RAG applications. In this new setting, the model is expected to retrieve the relevant chunks of a very long documents, which contain the critical information to answer the input question. \n- \ud83d\udd04 **Heterogeneous and Dynamic**. The testing data is generated w.r.t. diverse and constantly augmented domains and languages (i.e. Multi-domain, Multi-lingual). As a result, it is able to provide an increasingly comprehensive evaluation benchmark for the community developers.\n\n## \u2601\ufe0f Versions\n\nWe plan to release new test datasets on regular basis. The latest version is `AIR-Bench_24.05`.\n\n|      Version      | Release Date | # of domains | # of languages | # of datasets |                           Details                            |\n| :---------------: | :----------: | :------: | :--------: | :-------: | :----------------------------------------------------------: |\n| `AIR-Bench_24.05` | Oct 17, 2024 |    9 <sup>[1]</sup>   |     13 <sup>[2]</sup>    |    69     | [here](https://github.com/AIR-Bench/AIR-Bench/blob/main/docs/available_tasks.md#air-bench_2405) |\n| `AIR-Bench_24.04` | May 21, 2024 |    8 <sup>[3]</sup>    |     2 <sup>[4]</sup>     |    28     | [here](https://github.com/AIR-Bench/AIR-Bench/blob/main/docs/available_tasks.md#air-bench_2404) |\n\n> [1] wiki, web, news, healthcare, law, finance, arxiv, book, science.\n>\n> [2] en, zh, es, fr, de, ru, ja, ko, ar, fa, id, hi, bn (English, Chinese, Spanish, French, German, Russian, Japanese, Korean, Arabic, Persian, Indonesian, Hindi, Bengali).\n>\n> [3] wiki, web, news, healthcare, law, finance, arxiv, book.\n>\n> [4] en, zh (English, Chinese).\n\nFor the differences between different versions, please refer to [here](https://github.com/AIR-Bench/AIR-Bench/blob/main/docs/available_tasks.md).\n\n## \u2601\ufe0f Results\n\nYou could check out the results at\n[AIR-Bench Leaderboard](https://huggingface.co/spaces/AIR-Bench/leaderboard). Detailed results are available in [eval_results](https://huggingface.co/datasets/AIR-Bench/eval_results/tree/main).\n\nSome brief analysis results are available [here](https://github.com/AIR-Bench/AIR-Bench/blob/main/docs/available_analysis_results.md). *The technical report is coming soon*. Please stay tuned for updates!\n\n## \u2601\ufe0f Usage\n### Installation\nThis repo is used to maintain the codebases for running AIR-Bench evaluation. To run the evaluation, please install `air-benchmark`.\n\n```bash\npip install air-benchmark\n```\n\n### Evaluations\n\nRefer to the steps below to run evaluations and submit the results to the leaderboard (refer to [here](https://github.com/AIR-Bench/AIR-Bench/blob/main/docs/submit_to_leaderboard.md) for more detailed information).\n\n1. Run evaluations\n    - See the [scripts](https://github.com/AIR-Bench/AIR-Bench/blob/main/scripts) to run evaluations on AIR-Bench for your models.\n\n2. Submit search results (*Only for test set*)\n    - Package the output files\n      - As for the results without reranking models,\n\n      ```bash\n      cd scripts\n      python zip_results.py \\\n      --results_dir search_results \\\n      --retriever_name [YOUR_RETRIEVAL_MODEL] \\\n      --save_dir search_results\n      ```\n\n      - As for the results with reranking models\n\n      ```bash\n      cd scripts\n      python zip_results.py \\\n      --results_dir search_results \\\n      --retriever_name [YOUR_RETRIEVAL_MODEL] \\\n      --reranker_name [YOUR_RERANKING_MODEL] \\\n      --save_dir search_results\n      ```\n\n    - Upload the output `.zip` and fill in the model information at [AIR-Bench Leaderboard](https://huggingface.co/spaces/AIR-Bench/leaderboard)\n\n## \u2601\ufe0f Documentation\n\n| Documentation                                                |                                                           |\n| ------------------------------------------------------------ | --------------------------------------------------------- |\n| \ud83c\udfed [Pipeline](https://github.com/AIR-Bench/AIR-Bench/blob/main/docs/data_generation.md) | The data generation pipeline of AIR-Bench                 |\n| \ud83d\udccb [Tasks](https://github.com/AIR-Bench/AIR-Bench/blob/main/docs/available_tasks.md) | Overview of available tasks in AIR-Bench                  |\n| \ud83d\udcc8 [Leaderboard](https://huggingface.co/spaces/AIR-Bench/leaderboard) | The interactive leaderboard of AIR-Bench                  |\n| \ud83d\ude80 [Submit](https://github.com/AIR-Bench/AIR-Bench/blob/main/docs/submit_to_leaderboard.md) | Information related to how to submit a model to AIR-Bench |\n| \ud83e\udd1d [Contributing](https://github.com/AIR-Bench/AIR-Bench/blob/main/docs/community_contribution.md) | How to contribute to AIR-Bench                            |\n\n\n## \u2601\ufe0f Acknowledgement\nThis work is inspired by [MTEB](https://github.com/embeddings-benchmark/mteb) and [BEIR](https://github.com/beir-cellar/beir/). Many thanks for the early feedbacks from [@tomaarsen](https://github.com/tomaarsen), [@Muennighoff](https://github.com/Muennighoff), [@takatost](https://github.com/takatost), [@chtlp](https://github.com/chtlp).\n\n\n## \u2601\ufe0f Citing\n\n*The technical report is coming soon*. Please stay tuned for updates!\n",
    "bugtrack_url": null,
    "license": "MIT License  Copyright (c) 2024 AIR-Bench  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ",
    "summary": "AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark",
    "version": "0.1.0",
    "project_urls": {
        "Huggingface Organization": "https://huggingface.co/AIR-Bench",
        "Leaderboard": "https://huggingface.co/spaces/AIR-Bench/leaderboard",
        "homepage": "https://github.com/AIR-Bench/AIR-Bench/tree/main"
    },
    "split_keywords": [
        "embedding",
        " benchmark",
        " air-bench",
        " reranker",
        " information retrieval"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e48b6e4732d2367a63c2cf36e7e1b2703c7f11d63232f2e80ad2a6a51e0160ec",
                "md5": "be86ebcb9c504abb1e5a0f15df424ae1",
                "sha256": "e619d3cfe1d9a5a434e9fad8e9dba3f0a569961e66a6a4091e49ab13e4d4f37f"
            },
            "downloads": -1,
            "filename": "air_benchmark-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "be86ebcb9c504abb1e5a0f15df424ae1",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 48050,
            "upload_time": "2024-10-17T09:46:35",
            "upload_time_iso_8601": "2024-10-17T09:46:35.543705Z",
            "url": "https://files.pythonhosted.org/packages/e4/8b/6e4732d2367a63c2cf36e7e1b2703c7f11d63232f2e80ad2a6a51e0160ec/air_benchmark-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "df15dce34be9b2f304880bf132a4da27eff51a71778cd75ff94352cf16cfbeb9",
                "md5": "a64719d1e6b9db509ed437e7a150da1e",
                "sha256": "6cd40c86d03ed7ba805a934582911fa39afed6269290ef0bab71efcd43ede137"
            },
            "downloads": -1,
            "filename": "air_benchmark-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "a64719d1e6b9db509ed437e7a150da1e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 38692,
            "upload_time": "2024-10-17T09:46:37",
            "upload_time_iso_8601": "2024-10-17T09:46:37.355734Z",
            "url": "https://files.pythonhosted.org/packages/df/15/dce34be9b2f304880bf132a4da27eff51a71778cd75ff94352cf16cfbeb9/air_benchmark-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-17 09:46:37",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "AIR-Bench",
    "github_project": "AIR-Bench",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "air-benchmark"
}
        
Elapsed time: 1.13017s