evalscope


Nameevalscope JSON
Version 0.6.0 PyPI version JSON
download
home_pagehttps://github.com/modelscope/evalscope
SummaryEvalScope: Lightweight LLMs Evaluation Framework
upload_time2024-11-08 05:53:47
maintainerNone
docs_urlNone
authorModelScope team
requires_python>=3.8
licenseNone
keywords python llm evaluation
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            

![](docs/en/_static/images/evalscope_logo.png)

<p align="center">
    English | <a href="README_zh.md">็ฎ€ไฝ“ไธญๆ–‡</a>
</p>

<p align="center">
<a href="https://badge.fury.io/py/evalscope"><img src="https://badge.fury.io/py/evalscope.svg" alt="PyPI version" height="18"></a>
<a href="https://pypi.org/project/evalscope"><img alt="PyPI - Downloads" src="https://static.pepy.tech/badge/evalscope">
</a>
<a href='https://evalscope.readthedocs.io/en/latest/?badge=latest'>
    <img src='https://readthedocs.org/projects/evalscope-en/badge/?version=latest' alt='Documentation Status' />
</a>
<br>
 <a href="https://evalscope.readthedocs.io/en/latest/">๐Ÿ“– Documents</a>
<p>


## ๐Ÿ“‹ Table of Contents
- [Introduction](#introduction)
- [News](#News)
- [Installation](#installation)
- [Quick Start](#quick-start)
- [Evaluation Backend](#evaluation-backend)
- [Custom Dataset Evaluation](#custom-dataset-evaluation)
- [Offline Evaluation](#offline-evaluation)
- [Arena Mode](#arena-mode)
- [Model Serving Performance Evaluation](#Model-Serving-Performance-Evaluation)


## ๐Ÿ“ Introduction

EvalScope is the official model evaluation and performance benchmarking framework launched by the [ModelScope](https://modelscope.cn/) community. It comes with built-in common benchmarks and evaluation metrics, such as MMLU, CMMLU, C-Eval, GSM8K, ARC, HellaSwag, TruthfulQA, MATH, and HumanEval. EvalScope supports various types of model evaluations, including LLMs, multimodal LLMs, embedding models, and reranker models. It is also applicable to multiple evaluation scenarios, such as end-to-end RAG evaluation, arena mode, and model inference performance stress testing. Moreover, with the seamless integration of the ms-swift training framework, evaluations can be initiated with a single click, providing full end-to-end support from model training to evaluation ๐Ÿš€

<p align="center">
  <img src="docs/en/_static/images/evalscope_framework.png" width="70%">
  <br>EvalScope Framework.
</p>

The architecture includes the following modules:
1. **Model Adapter**: The model adapter is used to convert the outputs of specific models into the format required by the framework, supporting both API call models and locally run models.
2. **Data Adapter**: The data adapter is responsible for converting and processing input data to meet various evaluation needs and formats.
3. **Evaluation Backend**: 
    - **Native**: EvalScopeโ€™s own **default evaluation framework**, supporting various evaluation modes, including single model evaluation, arena mode, baseline model comparison mode, etc.
    - **OpenCompass**: Supports [OpenCompass](https://github.com/open-compass/opencompass) as the evaluation backend, providing advanced encapsulation and task simplification, allowing you to submit tasks for evaluation more easily.
    - **VLMEvalKit**: Supports [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) as the evaluation backend, enabling easy initiation of multi-modal evaluation tasks, supporting various multi-modal models and datasets.
    - **RAGEval**: Supports RAG evaluation, supporting independent evaluation of embedding models and rerankers using [MTEB/CMTEB](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/mteb.html), as well as end-to-end evaluation using [RAGAS](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/ragas.html).
    - **ThirdParty**: Other third-party evaluation tasks, such as ToolBench.
4. **Performance Evaluator**: Model performance evaluation, responsible for measuring model inference service performance, including performance testing, stress testing, performance report generation, and visualization.
5. **Evaluation Report**: The final generated evaluation report summarizes the model's performance, which can be used for decision-making and further model optimization.
6. **Visualization**: Visualization results help users intuitively understand evaluation results, facilitating analysis and comparison of different model performances.


## ๐ŸŽ‰ News
- ๐Ÿ”ฅ **[2024.10.31]** The best practice for evaluating Multimodal-RAG has been updated, please check the [๐Ÿ“– Blog](https://evalscope.readthedocs.io/zh-cn/latest/blog/RAG/multimodal_RAG.html#multimodal-rag) for more details.
- ๐Ÿ”ฅ **[2024.10.23]** Supports multimodal RAG evaluation, including the assessment of image-text retrieval using [CLIP_Benchmark](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/clip_benchmark.html), and extends [RAGAS](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/ragas.html) to support end-to-end multimodal metrics evaluation.
- ๐Ÿ”ฅ **[2024.10.8]** Support for RAG evaluation, including independent evaluation of embedding models and rerankers using [MTEB/CMTEB](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/mteb.html), as well as end-to-end evaluation using [RAGAS](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/ragas.html).
- ๐Ÿ”ฅ **[2024.09.18]** Our documentation has been updated to include a blog module, featuring some technical research and discussions related to evaluations. We invite you to [๐Ÿ“– read it](https://evalscope.readthedocs.io/en/refact_readme/blog/index.html).
- ๐Ÿ”ฅ **[2024.09.12]** Support for LongWriter evaluation, which supports 10,000+ word generation. You can use the benchmark [LongBench-Write](evalscope/third_party/longbench_write/README.md) to measure the long output quality as well as the output length.
- ๐Ÿ”ฅ **[2024.08.30]** Support for custom dataset evaluations, including text datasets and multimodal image-text datasets.
- ๐Ÿ”ฅ **[2024.08.20]** Updated the official documentation, including getting started guides, best practices, and FAQs. Feel free to [๐Ÿ“–read it here](https://evalscope.readthedocs.io/en/latest/)!
- ๐Ÿ”ฅ **[2024.08.09]** Simplified the installation process, allowing for pypi installation of vlmeval dependencies; optimized the multimodal model evaluation experience, achieving up to 10x acceleration based on the OpenAI API evaluation chain.
- ๐Ÿ”ฅ **[2024.07.31]** Important change: The package name `llmuses` has been changed to `evalscope`. Please update your code accordingly.
- ๐Ÿ”ฅ **[2024.07.26]** Support for **VLMEvalKit** as a third-party evaluation framework to initiate multimodal model evaluation tasks.
- ๐Ÿ”ฅ **[2024.06.29]** Support for **OpenCompass** as a third-party evaluation framework, which we have encapsulated at a higher level, supporting pip installation and simplifying evaluation task configuration.
- ๐Ÿ”ฅ **[2024.06.13]** EvalScope seamlessly integrates with the fine-tuning framework SWIFT, providing full-chain support from LLM training to evaluation.
- ๐Ÿ”ฅ **[2024.06.13]** Integrated the Agent evaluation dataset ToolBench.



## ๐Ÿ› ๏ธ Installation
### Method 1: Install Using pip
We recommend using conda to manage your environment and installing dependencies with pip:

1. Create a conda environment (optional)
   ```shell
   # It is recommended to use Python 3.10
   conda create -n evalscope python=3.10
   # Activate the conda environment
   conda activate evalscope
   ```

2. Install dependencies using pip
   ```shell
   pip install evalscope                # Install Native backend (default)
   # Additional options
   pip install evalscope[opencompass]   # Install OpenCompass backend
   pip install evalscope[vlmeval]       # Install VLMEvalKit backend
   pip install evalscope[all]           # Install all backends (Native, OpenCompass, VLMEvalKit)
   ```

> [!WARNING]
> As the project has been renamed to `evalscope`, for versions `v0.4.3` or earlier, you can install using the following command:
> ```shell
> pip install llmuses<=0.4.3
> ```
> To import relevant dependencies using `llmuses`:
> ``` python
> from llmuses import ...
> ```

### Method 2: Install from Source
1. Download the source code
   ```shell
   git clone https://github.com/modelscope/evalscope.git
   ```

2. Install dependencies
   ```shell
   cd evalscope/
   pip install -e .                  # Install Native backend
   # Additional options
   pip install -e '.[opencompass]'   # Install OpenCompass backend
   pip install -e '.[vlmeval]'       # Install VLMEvalKit backend
   pip install -e '.[all]'           # Install all backends (Native, OpenCompass, VLMEvalKit)
   ```


## ๐Ÿš€ Quick Start

### 1. Simple Evaluation
To evaluate a model using default settings on specified datasets, follow the process below:

#### Install using pip
You can execute this command from any directory:
```bash
python -m evalscope.run \
 --model qwen/Qwen2-0.5B-Instruct \
 --template-type qwen \
 --datasets arc 
```

#### Install from source
Execute this command in the `evalscope` directory:
```bash
python evalscope/run.py \
 --model qwen/Qwen2-0.5B-Instruct \
 --template-type qwen \
 --datasets arc
```

If prompted with `Do you wish to run the custom code? [y/N]`, please type `y`.


#### Basic Parameter Descriptions
- `--model`: Specifies the `model_id` of the model on [ModelScope](https://modelscope.cn/), allowing automatic download. For example, see the [Qwen2-0.5B-Instruct model link](https://modelscope.cn/models/qwen/Qwen2-0.5B-Instruct/summary); you can also use a local path, such as `/path/to/model`.
- `--template-type`: Specifies the template type corresponding to the model. Refer to the `Default Template` field in the [template table](https://swift.readthedocs.io/en/latest/Instruction/Supported-models-datasets.html#llm) for filling in this field.
- `--datasets`: The dataset name, allowing multiple datasets to be specified, separated by spaces; these datasets will be automatically downloaded. Refer to the [supported datasets list](https://evalscope.readthedocs.io/en/latest/get_started/supported_dataset.html) for available options.

### 2. Parameterized Evaluation
If you wish to conduct a more customized evaluation, such as modifying model parameters or dataset parameters, you can use the following commands:

**Example 1:**
```shell
python evalscope/run.py \
 --model qwen/Qwen2-0.5B-Instruct \
 --template-type qwen \
 --model-args revision=master,precision=torch.float16,device_map=auto \
 --datasets gsm8k ceval \
 --use-cache true \
 --limit 10
```

**Example 2:**
```shell
python evalscope/run.py \
 --model qwen/Qwen2-0.5B-Instruct \
 --template-type qwen \
 --generation-config do_sample=false,temperature=0.0 \
 --datasets ceval \
 --dataset-args '{"ceval": {"few_shot_num": 0, "few_shot_random": false}}' \
 --limit 10
```

#### Parameter Descriptions
In addition to the three [basic parameters](#basic-parameter-descriptions), the other parameters are as follows:
- `--model-args`: Model loading parameters, separated by commas, in `key=value` format.
- `--generation-config`: Generation parameters, separated by commas, in `key=value` format.
  - `do_sample`: Whether to use sampling, default is `false`.
  - `max_new_tokens`: Maximum generation length, default is 1024.
  - `temperature`: Sampling temperature.
  - `top_p`: Sampling threshold.
  - `top_k`: Sampling threshold.
- `--use-cache`: Whether to use local cache, default is `false`. If set to `true`, previously evaluated model and dataset combinations will not be evaluated again, and will be read directly from the local cache.
- `--dataset-args`: Evaluation dataset configuration parameters, provided in JSON format, where the key is the dataset name and the value is the parameter; note that these must correspond one-to-one with the values in `--datasets`.
  - `--few_shot_num`: Number of few-shot examples.
  - `--few_shot_random`: Whether to randomly sample few-shot data; if not specified, defaults to `true`.
- `--limit`: Maximum number of evaluation samples per dataset; if not specified, all will be evaluated, which is useful for quick validation.

### 3. Use the run_task Function to Submit an Evaluation Task
Using the `run_task` function to submit an evaluation task requires the same parameters as the command line. You need to pass a dictionary as the parameter, which includes the following fields:

#### 1. Configuration Task Dictionary Parameters
```python
import torch
from evalscope.constants import DEFAULT_ROOT_CACHE_DIR

# Example
your_task_cfg = {
        'model_args': {'revision': None, 'precision': torch.float16, 'device_map': 'auto'},
        'generation_config': {'do_sample': False, 'repetition_penalty': 1.0, 'max_new_tokens': 512},
        'dataset_args': {},
        'dry_run': False,
        'model': 'qwen/Qwen2-0.5B-Instruct',
        'template_type': 'qwen',
        'datasets': ['arc', 'hellaswag'],
        'work_dir': DEFAULT_ROOT_CACHE_DIR,
        'outputs': DEFAULT_ROOT_CACHE_DIR,
        'mem_cache': False,
        'dataset_hub': 'ModelScope',
        'dataset_dir': DEFAULT_ROOT_CACHE_DIR,
        'limit': 10,
        'debug': False
    }
```
Here, `DEFAULT_ROOT_CACHE_DIR` is set to `'~/.cache/evalscope'`.

#### 2. Execute Task with run_task
```python
from evalscope.run import run_task
run_task(task_cfg=your_task_cfg)
```


## Evaluation Backend
EvalScope supports using third-party evaluation frameworks to initiate evaluation tasks, which we call Evaluation Backend. Currently supported Evaluation Backend includes:
- **Native**: EvalScope's own **default evaluation framework**, supporting various evaluation modes including single model evaluation, arena mode, and baseline model comparison mode.
- [OpenCompass](https://github.com/open-compass/opencompass): Initiate OpenCompass evaluation tasks through EvalScope. Lightweight, easy to customize, supports seamless integration with the LLM fine-tuning framework ms-swift. [๐Ÿ“– User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/backend/opencompass_backend.html)
- [VLMEvalKit](https://github.com/open-compass/VLMEvalKit): Initiate VLMEvalKit multimodal evaluation tasks through EvalScope. Supports various multimodal models and datasets, and offers seamless integration with the LLM fine-tuning framework ms-swift. [๐Ÿ“– User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/backend/vlmevalkit_backend.html)
- **RAGEval**: Initiate RAG evaluation tasks through EvalScope, supporting independent evaluation of embedding models and rerankers using [MTEB/CMTEB](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/mteb.html), as well as end-to-end evaluation using [RAGAS](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/ragas.html): [๐Ÿ“– User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/index.html)
- **ThirdParty**: Third-party evaluation tasks, such as [ToolBench](https://evalscope.readthedocs.io/en/latest/third_party/toolbench.html) and [LongBench-Write](https://evalscope.readthedocs.io/en/latest/third_party/longwriter.html).

## Custom Dataset Evaluation
EvalScope supports custom dataset evaluation. For detailed information, please refer to the Custom Dataset Evaluation [๐Ÿ“–User Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/custom_dataset.html)

## Offline Evaluation
You can use local dataset to evaluate the model without internet connection. 

Refer to: Offline Evaluation [๐Ÿ“– User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/offline_evaluation.html)


## Arena Mode
The Arena mode allows multiple candidate models to be evaluated through pairwise battles, and can choose to use the AI Enhanced Auto-Reviewer (AAR) automatic evaluation process or manual evaluation to obtain the evaluation report. 

Refer to: Arena Mode [๐Ÿ“– User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/arena.html)

## Model Serving Performance Evaluation
A stress testing tool that focuses on large language models and can be customized to support various data set formats and different API protocol formats.

Refer to : Model Serving Performance Evaluation [๐Ÿ“– User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test.html)



## TO-DO List
- [x] RAG evaluation
- [x] VLM evaluation
- [x] Agents evaluation
- [x] vLLM
- [ ] Distributed evaluating
- [x] Multi-modal evaluation
- [ ] Benchmarks
  - [ ] GAIA
  - [ ] GPQA
  - [x] MBPP
- [ ] Auto-reviewer
  - [ ] Qwen-max


## Star History

[![Star History Chart](https://api.star-history.com/svg?repos=modelscope/evalscope&type=Date)](https://star-history.com/#modelscope/evalscope&Date)

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/modelscope/evalscope",
    "name": "evalscope",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "python, llm, evaluation",
    "author": "ModelScope team",
    "author_email": "contact@modelscope.cn",
    "download_url": "https://files.pythonhosted.org/packages/9d/ac/1f432bcc46ccb8348b869b80d2aaabde5e583b370418ba48714083e31068/evalscope-0.6.0.tar.gz",
    "platform": null,
    "description": "\n\n![](docs/en/_static/images/evalscope_logo.png)\n\n<p align=\"center\">\n    English | <a href=\"README_zh.md\">\u7b80\u4f53\u4e2d\u6587</a>\n</p>\n\n<p align=\"center\">\n<a href=\"https://badge.fury.io/py/evalscope\"><img src=\"https://badge.fury.io/py/evalscope.svg\" alt=\"PyPI version\" height=\"18\"></a>\n<a href=\"https://pypi.org/project/evalscope\"><img alt=\"PyPI - Downloads\" src=\"https://static.pepy.tech/badge/evalscope\">\n</a>\n<a href='https://evalscope.readthedocs.io/en/latest/?badge=latest'>\n    <img src='https://readthedocs.org/projects/evalscope-en/badge/?version=latest' alt='Documentation Status' />\n</a>\n<br>\n <a href=\"https://evalscope.readthedocs.io/en/latest/\">\ud83d\udcd6 Documents</a>\n<p>\n\n\n## \ud83d\udccb Table of Contents\n- [Introduction](#introduction)\n- [News](#News)\n- [Installation](#installation)\n- [Quick Start](#quick-start)\n- [Evaluation Backend](#evaluation-backend)\n- [Custom Dataset Evaluation](#custom-dataset-evaluation)\n- [Offline Evaluation](#offline-evaluation)\n- [Arena Mode](#arena-mode)\n- [Model Serving Performance Evaluation](#Model-Serving-Performance-Evaluation)\n\n\n## \ud83d\udcdd Introduction\n\nEvalScope is the official model evaluation and performance benchmarking framework launched by the [ModelScope](https://modelscope.cn/) community. It comes with built-in common benchmarks and evaluation metrics, such as MMLU, CMMLU, C-Eval, GSM8K, ARC, HellaSwag, TruthfulQA, MATH, and HumanEval. EvalScope supports various types of model evaluations, including LLMs, multimodal LLMs, embedding models, and reranker models. It is also applicable to multiple evaluation scenarios, such as end-to-end RAG evaluation, arena mode, and model inference performance stress testing. Moreover, with the seamless integration of the ms-swift training framework, evaluations can be initiated with a single click, providing full end-to-end support from model training to evaluation \ud83d\ude80\n\n<p align=\"center\">\n  <img src=\"docs/en/_static/images/evalscope_framework.png\" width=\"70%\">\n  <br>EvalScope Framework.\n</p>\n\nThe architecture includes the following modules:\n1. **Model Adapter**: The model adapter is used to convert the outputs of specific models into the format required by the framework, supporting both API call models and locally run models.\n2. **Data Adapter**: The data adapter is responsible for converting and processing input data to meet various evaluation needs and formats.\n3. **Evaluation Backend**: \n    - **Native**: EvalScope\u2019s own **default evaluation framework**, supporting various evaluation modes, including single model evaluation, arena mode, baseline model comparison mode, etc.\n    - **OpenCompass**: Supports [OpenCompass](https://github.com/open-compass/opencompass) as the evaluation backend, providing advanced encapsulation and task simplification, allowing you to submit tasks for evaluation more easily.\n    - **VLMEvalKit**: Supports [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) as the evaluation backend, enabling easy initiation of multi-modal evaluation tasks, supporting various multi-modal models and datasets.\n    - **RAGEval**: Supports RAG evaluation, supporting independent evaluation of embedding models and rerankers using [MTEB/CMTEB](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/mteb.html), as well as end-to-end evaluation using [RAGAS](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/ragas.html).\n    - **ThirdParty**: Other third-party evaluation tasks, such as ToolBench.\n4. **Performance Evaluator**: Model performance evaluation, responsible for measuring model inference service performance, including performance testing, stress testing, performance report generation, and visualization.\n5. **Evaluation Report**: The final generated evaluation report summarizes the model's performance, which can be used for decision-making and further model optimization.\n6. **Visualization**: Visualization results help users intuitively understand evaluation results, facilitating analysis and comparison of different model performances.\n\n\n## \ud83c\udf89 News\n- \ud83d\udd25 **[2024.10.31]** The best practice for evaluating Multimodal-RAG has been updated, please check the [\ud83d\udcd6 Blog](https://evalscope.readthedocs.io/zh-cn/latest/blog/RAG/multimodal_RAG.html#multimodal-rag) for more details.\n- \ud83d\udd25 **[2024.10.23]** Supports multimodal RAG evaluation, including the assessment of image-text retrieval using [CLIP_Benchmark](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/clip_benchmark.html), and extends [RAGAS](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/ragas.html) to support end-to-end multimodal metrics evaluation.\n- \ud83d\udd25 **[2024.10.8]** Support for RAG evaluation, including independent evaluation of embedding models and rerankers using [MTEB/CMTEB](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/mteb.html), as well as end-to-end evaluation using [RAGAS](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/ragas.html).\n- \ud83d\udd25 **[2024.09.18]** Our documentation has been updated to include a blog module, featuring some technical research and discussions related to evaluations. We invite you to [\ud83d\udcd6 read it](https://evalscope.readthedocs.io/en/refact_readme/blog/index.html).\n- \ud83d\udd25 **[2024.09.12]** Support for LongWriter evaluation, which supports 10,000+ word generation. You can use the benchmark [LongBench-Write](evalscope/third_party/longbench_write/README.md) to measure the long output quality as well as the output length.\n- \ud83d\udd25 **[2024.08.30]** Support for custom dataset evaluations, including text datasets and multimodal image-text datasets.\n- \ud83d\udd25 **[2024.08.20]** Updated the official documentation, including getting started guides, best practices, and FAQs. Feel free to [\ud83d\udcd6read it here](https://evalscope.readthedocs.io/en/latest/)!\n- \ud83d\udd25 **[2024.08.09]** Simplified the installation process, allowing for pypi installation of vlmeval dependencies; optimized the multimodal model evaluation experience, achieving up to 10x acceleration based on the OpenAI API evaluation chain.\n- \ud83d\udd25 **[2024.07.31]** Important change: The package name `llmuses` has been changed to `evalscope`. Please update your code accordingly.\n- \ud83d\udd25 **[2024.07.26]** Support for **VLMEvalKit** as a third-party evaluation framework to initiate multimodal model evaluation tasks.\n- \ud83d\udd25 **[2024.06.29]** Support for **OpenCompass** as a third-party evaluation framework, which we have encapsulated at a higher level, supporting pip installation and simplifying evaluation task configuration.\n- \ud83d\udd25 **[2024.06.13]** EvalScope seamlessly integrates with the fine-tuning framework SWIFT, providing full-chain support from LLM training to evaluation.\n- \ud83d\udd25 **[2024.06.13]** Integrated the Agent evaluation dataset ToolBench.\n\n\n\n## \ud83d\udee0\ufe0f Installation\n### Method 1: Install Using pip\nWe recommend using conda to manage your environment and installing dependencies with pip:\n\n1. Create a conda environment (optional)\n   ```shell\n   # It is recommended to use Python 3.10\n   conda create -n evalscope python=3.10\n   # Activate the conda environment\n   conda activate evalscope\n   ```\n\n2. Install dependencies using pip\n   ```shell\n   pip install evalscope                # Install Native backend (default)\n   # Additional options\n   pip install evalscope[opencompass]   # Install OpenCompass backend\n   pip install evalscope[vlmeval]       # Install VLMEvalKit backend\n   pip install evalscope[all]           # Install all backends (Native, OpenCompass, VLMEvalKit)\n   ```\n\n> [!WARNING]\n> As the project has been renamed to `evalscope`, for versions `v0.4.3` or earlier, you can install using the following command:\n> ```shell\n> pip install llmuses<=0.4.3\n> ```\n> To import relevant dependencies using `llmuses`:\n> ``` python\n> from llmuses import ...\n> ```\n\n### Method 2: Install from Source\n1. Download the source code\n   ```shell\n   git clone https://github.com/modelscope/evalscope.git\n   ```\n\n2. Install dependencies\n   ```shell\n   cd evalscope/\n   pip install -e .                  # Install Native backend\n   # Additional options\n   pip install -e '.[opencompass]'   # Install OpenCompass backend\n   pip install -e '.[vlmeval]'       # Install VLMEvalKit backend\n   pip install -e '.[all]'           # Install all backends (Native, OpenCompass, VLMEvalKit)\n   ```\n\n\n## \ud83d\ude80 Quick Start\n\n### 1. Simple Evaluation\nTo evaluate a model using default settings on specified datasets, follow the process below:\n\n#### Install using pip\nYou can execute this command from any directory:\n```bash\npython -m evalscope.run \\\n --model qwen/Qwen2-0.5B-Instruct \\\n --template-type qwen \\\n --datasets arc \n```\n\n#### Install from source\nExecute this command in the `evalscope` directory:\n```bash\npython evalscope/run.py \\\n --model qwen/Qwen2-0.5B-Instruct \\\n --template-type qwen \\\n --datasets arc\n```\n\nIf prompted with `Do you wish to run the custom code? [y/N]`, please type `y`.\n\n\n#### Basic Parameter Descriptions\n- `--model`: Specifies the `model_id` of the model on [ModelScope](https://modelscope.cn/), allowing automatic download. For example, see the [Qwen2-0.5B-Instruct model link](https://modelscope.cn/models/qwen/Qwen2-0.5B-Instruct/summary); you can also use a local path, such as `/path/to/model`.\n- `--template-type`: Specifies the template type corresponding to the model. Refer to the `Default Template` field in the [template table](https://swift.readthedocs.io/en/latest/Instruction/Supported-models-datasets.html#llm) for filling in this field.\n- `--datasets`: The dataset name, allowing multiple datasets to be specified, separated by spaces; these datasets will be automatically downloaded. Refer to the [supported datasets list](https://evalscope.readthedocs.io/en/latest/get_started/supported_dataset.html) for available options.\n\n### 2. Parameterized Evaluation\nIf you wish to conduct a more customized evaluation, such as modifying model parameters or dataset parameters, you can use the following commands:\n\n**Example 1:**\n```shell\npython evalscope/run.py \\\n --model qwen/Qwen2-0.5B-Instruct \\\n --template-type qwen \\\n --model-args revision=master,precision=torch.float16,device_map=auto \\\n --datasets gsm8k ceval \\\n --use-cache true \\\n --limit 10\n```\n\n**Example 2:**\n```shell\npython evalscope/run.py \\\n --model qwen/Qwen2-0.5B-Instruct \\\n --template-type qwen \\\n --generation-config do_sample=false,temperature=0.0 \\\n --datasets ceval \\\n --dataset-args '{\"ceval\": {\"few_shot_num\": 0, \"few_shot_random\": false}}' \\\n --limit 10\n```\n\n#### Parameter Descriptions\nIn addition to the three [basic parameters](#basic-parameter-descriptions), the other parameters are as follows:\n- `--model-args`: Model loading parameters, separated by commas, in `key=value` format.\n- `--generation-config`: Generation parameters, separated by commas, in `key=value` format.\n  - `do_sample`: Whether to use sampling, default is `false`.\n  - `max_new_tokens`: Maximum generation length, default is 1024.\n  - `temperature`: Sampling temperature.\n  - `top_p`: Sampling threshold.\n  - `top_k`: Sampling threshold.\n- `--use-cache`: Whether to use local cache, default is `false`. If set to `true`, previously evaluated model and dataset combinations will not be evaluated again, and will be read directly from the local cache.\n- `--dataset-args`: Evaluation dataset configuration parameters, provided in JSON format, where the key is the dataset name and the value is the parameter; note that these must correspond one-to-one with the values in `--datasets`.\n  - `--few_shot_num`: Number of few-shot examples.\n  - `--few_shot_random`: Whether to randomly sample few-shot data; if not specified, defaults to `true`.\n- `--limit`: Maximum number of evaluation samples per dataset; if not specified, all will be evaluated, which is useful for quick validation.\n\n### 3. Use the run_task Function to Submit an Evaluation Task\nUsing the `run_task` function to submit an evaluation task requires the same parameters as the command line. You need to pass a dictionary as the parameter, which includes the following fields:\n\n#### 1. Configuration Task Dictionary Parameters\n```python\nimport torch\nfrom evalscope.constants import DEFAULT_ROOT_CACHE_DIR\n\n# Example\nyour_task_cfg = {\n        'model_args': {'revision': None, 'precision': torch.float16, 'device_map': 'auto'},\n        'generation_config': {'do_sample': False, 'repetition_penalty': 1.0, 'max_new_tokens': 512},\n        'dataset_args': {},\n        'dry_run': False,\n        'model': 'qwen/Qwen2-0.5B-Instruct',\n        'template_type': 'qwen',\n        'datasets': ['arc', 'hellaswag'],\n        'work_dir': DEFAULT_ROOT_CACHE_DIR,\n        'outputs': DEFAULT_ROOT_CACHE_DIR,\n        'mem_cache': False,\n        'dataset_hub': 'ModelScope',\n        'dataset_dir': DEFAULT_ROOT_CACHE_DIR,\n        'limit': 10,\n        'debug': False\n    }\n```\nHere, `DEFAULT_ROOT_CACHE_DIR` is set to `'~/.cache/evalscope'`.\n\n#### 2. Execute Task with run_task\n```python\nfrom evalscope.run import run_task\nrun_task(task_cfg=your_task_cfg)\n```\n\n\n## Evaluation Backend\nEvalScope supports using third-party evaluation frameworks to initiate evaluation tasks, which we call Evaluation Backend. Currently supported Evaluation Backend includes:\n- **Native**: EvalScope's own **default evaluation framework**, supporting various evaluation modes including single model evaluation, arena mode, and baseline model comparison mode.\n- [OpenCompass](https://github.com/open-compass/opencompass): Initiate OpenCompass evaluation tasks through EvalScope. Lightweight, easy to customize, supports seamless integration with the LLM fine-tuning framework ms-swift. [\ud83d\udcd6 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/backend/opencompass_backend.html)\n- [VLMEvalKit](https://github.com/open-compass/VLMEvalKit): Initiate VLMEvalKit multimodal evaluation tasks through EvalScope. Supports various multimodal models and datasets, and offers seamless integration with the LLM fine-tuning framework ms-swift. [\ud83d\udcd6 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/backend/vlmevalkit_backend.html)\n- **RAGEval**: Initiate RAG evaluation tasks through EvalScope, supporting independent evaluation of embedding models and rerankers using [MTEB/CMTEB](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/mteb.html), as well as end-to-end evaluation using [RAGAS](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/ragas.html): [\ud83d\udcd6 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/index.html)\n- **ThirdParty**: Third-party evaluation tasks, such as [ToolBench](https://evalscope.readthedocs.io/en/latest/third_party/toolbench.html) and [LongBench-Write](https://evalscope.readthedocs.io/en/latest/third_party/longwriter.html).\n\n## Custom Dataset Evaluation\nEvalScope supports custom dataset evaluation. For detailed information, please refer to the Custom Dataset Evaluation [\ud83d\udcd6User Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/custom_dataset.html)\n\n## Offline Evaluation\nYou can use local dataset to evaluate the model without internet connection. \n\nRefer to: Offline Evaluation [\ud83d\udcd6 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/offline_evaluation.html)\n\n\n## Arena Mode\nThe Arena mode allows multiple candidate models to be evaluated through pairwise battles, and can choose to use the AI Enhanced Auto-Reviewer (AAR) automatic evaluation process or manual evaluation to obtain the evaluation report. \n\nRefer to: Arena Mode [\ud83d\udcd6 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/arena.html)\n\n## Model Serving Performance Evaluation\nA stress testing tool that focuses on large language models and can be customized to support various data set formats and different API protocol formats.\n\nRefer to : Model Serving Performance Evaluation [\ud83d\udcd6 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test.html)\n\n\n\n## TO-DO List\n- [x] RAG evaluation\n- [x] VLM evaluation\n- [x] Agents evaluation\n- [x] vLLM\n- [ ] Distributed evaluating\n- [x] Multi-modal evaluation\n- [ ] Benchmarks\n  - [ ] GAIA\n  - [ ] GPQA\n  - [x] MBPP\n- [ ] Auto-reviewer\n  - [ ] Qwen-max\n\n\n## Star History\n\n[![Star History Chart](https://api.star-history.com/svg?repos=modelscope/evalscope&type=Date)](https://star-history.com/#modelscope/evalscope&Date)\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "EvalScope: Lightweight LLMs Evaluation Framework",
    "version": "0.6.0",
    "project_urls": {
        "Homepage": "https://github.com/modelscope/evalscope"
    },
    "split_keywords": [
        "python",
        " llm",
        " evaluation"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c2253f03d9d924f1b65610724c9f10727b48ef952afdfee8687a461949c88c78",
                "md5": "032fd48840c08a410c71729591089fe0",
                "sha256": "faa973fef8e37f9671c2fb6ec65b6c929559c34dda1c51360da7b2c50f774453"
            },
            "downloads": -1,
            "filename": "evalscope-0.6.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "032fd48840c08a410c71729591089fe0",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 333736,
            "upload_time": "2024-11-08T05:53:45",
            "upload_time_iso_8601": "2024-11-08T05:53:45.742196Z",
            "url": "https://files.pythonhosted.org/packages/c2/25/3f03d9d924f1b65610724c9f10727b48ef952afdfee8687a461949c88c78/evalscope-0.6.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9dac1f432bcc46ccb8348b869b80d2aaabde5e583b370418ba48714083e31068",
                "md5": "688bcaa66e04afe62a1e099c94c43ca4",
                "sha256": "2cdf8e344ede07389617ba30fd5e4495398647dfaf8cf1398f9f05a2a7459204"
            },
            "downloads": -1,
            "filename": "evalscope-0.6.0.tar.gz",
            "has_sig": false,
            "md5_digest": "688bcaa66e04afe62a1e099c94c43ca4",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 249228,
            "upload_time": "2024-11-08T05:53:47",
            "upload_time_iso_8601": "2024-11-08T05:53:47.964405Z",
            "url": "https://files.pythonhosted.org/packages/9d/ac/1f432bcc46ccb8348b869b80d2aaabde5e583b370418ba48714083e31068/evalscope-0.6.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-08 05:53:47",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "modelscope",
    "github_project": "evalscope",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "evalscope"
}
        
Elapsed time: 1.09885s