llmuses


Namellmuses JSON
Version 0.3.0 PyPI version JSON
download
home_pagehttps://github.com/modelscope/eval-scope
SummaryEval-Scope: Lightweight LLMs Evaluation Framework
upload_time2024-04-11 06:27:06
maintainerNone
docs_urlNone
authorModelScope team
requires_python>=3.7
licenseNone
keywords python llm evaluation
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ## 简介
大型语言模型评估(LLMs evaluation)已成为评价和改进大模型的重要流程和手段,为了更好地支持大模型的评测,我们提出了llmuses框架,该框架主要包括以下几个部分:
- 预置了多个常用的测试基准数据集,包括:MMLU、CMMLU、C-Eval、GSM8K、ARC、HellaSwag、TruthfulQA、MATH、HumanEval等
- 常用评估指标(metrics)的实现
- 统一model接入,兼容多个系列模型的generate、chat接口
- 自动评估(evaluator):
    - 客观题自动评估
    - 使用专家模型实现复杂任务的自动评估
- 评估报告生成
- 竞技场模式(Arena)
- 可视化工具
- [模型性能评估](llmuses/perf/README.md)

特点
- 轻量化,尽量减少不必要的抽象和配置
- 易于定制
  - 仅需实现一个类即可接入新的数据集
  - 模型可托管在[ModelScope](https://modelscope.cn)上,仅需model id即可一键发起评测
  - 支持本地模型可部署在本地
  - 评估报告可视化展现
- 丰富的评估指标
- model-based自动评估流程,支持多种评估模式
  - Single mode: 专家模型对单个模型打分
  - Pairwise-baseline mode: 与 baseline 模型对比
  - Pairwise (all) mode: 全部模型两两对比


## 环境准备
### 使用pip安装
我们推荐使用conda来管理环境,并使用pip安装依赖:
1. 创建conda环境
```shell
conda create -n eval-scope python=3.10
conda activate eval-scope
```
2. 安装依赖
```shell
pip install llmuses
```

### 使用源码安装
1. 下载源码
```shell
git clone https://github.com/modelscope/eval-scope.git
```
2. 安装依赖
```shell
cd eval-scope/
pip install -e .
```


## 快速开始

### 简单评估
在指定的若干数据集上评估某个模型,流程如下:
```shell
python llmuses/run.py --model ZhipuAI/chatglm3-6b --datasets mmlu ceval --limit 10
```
其中,--model参数指定了模型的ModelScope model id,模型链接:[ZhipuAI/chatglm3-6b](https://modelscope.cn/models/ZhipuAI/chatglm3-6b/summary)

### 带参数评估
```shell
python llmuses/run.py --model ZhipuAI/chatglm3-6b --model-args revision=v1.0.2,precision=torch.float16,device_map=auto --datasets mmlu ceval --mem-cache --limit 10

python llmuses/run.py --model qwen/Qwen-1_8B --generation-config do_sample=false,temperature=0.0 --datasets ceval --dataset-args '{"ceval": {"few_shot_num": 0, "few_shot_random": false}}' --limit 10

# 参数说明
# --model-args: 模型参数,以逗号分隔,key=value形式
# --datasets: 数据集名称,参考下文`数据集列表`章节
# --mem-cache: 是否使用内存缓存,若开启,则已经跑过的数据会自动缓存,并持久化到本地磁盘
# --limit: 每个subset最大评估数据量
# --dataset-args: 数据集的evaluation settings,以json格式传入,key为数据集名称,value为参数,注意需要跟--datasets参数中的值一一对应
#   -- few_shot_num: few-shot的数量
#   -- few_shot_random: 是否随机采样few-shot数据,如果不设置,则默认为true
```

### 使用本地数据集
数据集默认托管在[ModelScope](https://modelscope.cn/datasets)上,加载需要联网。如果是无网络环境,可以使用本地数据集,流程如下:
#### 1. 下载数据集到本地
```shell
# 假如当前本地工作路径为 /path/to/workdir
wget https://modelscope.oss-cn-beijing.aliyuncs.com/open_data/benchmark/data.zip
unzip data.zip
# 则解压后的数据集路径为:/path/to/workdir/data 目录下,该目录在后续步骤将会作为--dataset-dir参数的值传入
```
#### 2. 使用本地数据集创建评估任务
```shell
python llmuses/run.py --model ZhipuAI/chatglm3-6b --datasets arc --dataset-hub Local --dataset-dir /path/to/workdir/data --limit 10

# 参数说明
# --dataset-hub: 数据集来源,枚举值: `ModelScope`, `Local`, `HuggingFace` (TO-DO)  默认为`ModelScope`
# --dataset-dir: 当--dataset-hub为`Local`时,该参数指本地数据集路径; 如果--dataset-hub 设置为`ModelScope` or `HuggingFace`,则该参数的含义是数据集缓存路径。

```
#### 3. (可选)在离线环境加载模型和评测
模型文件托管在ModelScope Hub端,需要联网加载,当需要在离线环境创建评估任务时,可参考以下步骤:
```shell
# 1. 准备模型本地文件夹,文件夹结构参考chatglm3-6b,链接:https://modelscope.cn/models/ZhipuAI/chatglm3-6b/files
# 例如,将模型文件夹整体下载到本地路径 /path/to/ZhipuAI/chatglm3-6b

# 2. 执行离线评估任务
python llmuses/run.py --model /path/to/ZhipuAI/chatglm3-6b --datasets arc --dataset-hub Local --dataset-dir /path/to/workdir/data --limit 10

```

### 使用run_task函数提交评估任务
llmuses支持通过import依赖的方式实现任务提交,步骤如下:
#### 1. 安装依赖
```shell
# 参考上文`环境准备`章节,安装依赖requirements.txt中的内容

# 安装llmuses包
pip install https://sail-moe.oss-cn-hangzhou.aliyuncs.com/open_data/packages/llmuses-0.2.6-py3-none-any.whl
```

#### 2. 配置任务
```python
import torch
from llmuses.constants import DEFAULT_ROOT_CACHE_DIR

# 示例
your_task_cfg = {
        'model_args': {'revision': None, 'precision': torch.float16, 'device_map': 'auto'},
        'generation_config': {'do_sample': False, 'repetition_penalty': 1.0, 'max_new_tokens': 512},
        'dataset_args': {},
        'dry_run': False,
        'model': 'ZhipuAI/chatglm3-6b',
        'datasets': ['arc', 'hellaswag'],
        'work_dir': DEFAULT_ROOT_CACHE_DIR,
        'outputs': DEFAULT_ROOT_CACHE_DIR,
        'mem_cache': False,
        'dataset_hub': 'ModelScope',
        'dataset_dir': DEFAULT_ROOT_CACHE_DIR,
        'stage': 'all',
        'limit': 10,
        'debug': False
    }

```

#### 3. 执行任务
```python
from llmuses.run import run_task

run_task(task_cfg=your_task_cfg)
```


### 竞技场模式(Arena)
竞技场模式允许多个候选模型通过两两对比(pairwise battle)的方式进行评估,并可以选择借助AI Enhanced Auto-Reviewer(AAR)自动评估流程或者人工评估的方式,最终得到评估报告,流程示例如下:
#### 1. 环境准备
```text
a. 数据准备,questions data格式参考:llmuses/registry/data/question.jsonl
b. 如果需要使用自动评估流程(AAR),则需要配置相关环境变量,我们以GPT-4 based auto-reviewer流程为例,需要配置以下环境变量:
> export OPENAI_API_KEY=YOUR_OPENAI_API_KEY
```

#### 2. 配置文件
```text
arena评估流程的配置文件参考: llmuses/registry/config/cfg_arena.yaml
字段说明:
    questions_file: question data的路径
    answers_gen: 候选模型预测结果生成,支持多个模型,可通过enable参数控制是否开启该模型
    reviews_gen: 评估结果生成,目前默认使用GPT-4作为Auto-reviewer,可通过enable参数控制是否开启该步骤
    elo_rating: ELO rating 算法,可通过enable参数控制是否开启该步骤,注意该步骤依赖review_file必须存在
```

#### 3. 执行脚本
```shell
#Usage:
cd llmuses

# dry-run模式 (模型answer正常生成,但专家模型不会被触发,评估结果会随机生成)
python llmuses/run_arena.py -c registry/config/cfg_arena.yaml --dry-run

# 执行评估流程
python llmuses/run_arena.py --c registry/config/cfg_arena.yaml
```

#### 4. 结果可视化

```shell
# Usage:
streamlit run viz.py -- --review-file llmuses/registry/data/qa_browser/battle.jsonl --category-file llmuses/registry/data/qa_browser/category_mapping.yaml
```


### 单模型打分模式(Single mode)

这个模式下,我们只对单个模型输出做打分,不做两两对比。
#### 1. 配置文件
```text
评估流程的配置文件参考: llmuses/registry/config/cfg_single.yaml
字段说明:
    questions_file: question data的路径
    answers_gen: 候选模型预测结果生成,支持多个模型,可通过enable参数控制是否开启该模型
    reviews_gen: 评估结果生成,目前默认使用GPT-4作为Auto-reviewer,可通过enable参数控制是否开启该步骤
    rating_gen: rating 算法,可通过enable参数控制是否开启该步骤,注意该步骤依赖review_file必须存在
```
#### 2. 执行脚本
```shell
#Example:
python llmuses/run_arena.py --c registry/config/cfg_single.yaml
```

### Baseline模型对比模式(Pairwise-baseline mode)

这个模式下,我们选定 baseline 模型,其他模型与 baseline 模型做对比评分。这个模式可以方便的把新模型加入到 Leaderboard 中(只需要对新模型跟 baseline 模型跑一遍打分即可)
#### 1. 配置文件
```text
评估流程的配置文件参考: llmuses/registry/config/cfg_pairwise_baseline.yaml
字段说明:
    questions_file: question data的路径
    answers_gen: 候选模型预测结果生成,支持多个模型,可通过enable参数控制是否开启该模型
    reviews_gen: 评估结果生成,目前默认使用GPT-4作为Auto-reviewer,可通过enable参数控制是否开启该步骤
    rating_gen: rating 算法,可通过enable参数控制是否开启该步骤,注意该步骤依赖review_file必须存在
```
#### 2. 执行脚本
```shell
# Example:
python llmuses/run_arena.py --c llmuses/registry/config/cfg_pairwise_baseline.yaml
```


## 数据集列表

| DatasetName        | Link                                                                                   | Status | Note |
|--------------------|----------------------------------------------------------------------------------------|--------|------|
| `mmlu`             | [mmlu](https://modelscope.cn/datasets/modelscope/mmlu/summary)                         | Active |      |
| `ceval`            | [ceval](https://modelscope.cn/datasets/modelscope/ceval-exam/summary)                  | Active |      |
| `gsm8k`            | [gsm8k](https://modelscope.cn/datasets/modelscope/gsm8k/summary)                       | Active |      |
| `arc`              | [arc](https://modelscope.cn/datasets/modelscope/ai2_arc/summary)                       | Active |      |
| `hellaswag`        | [hellaswag](https://modelscope.cn/datasets/modelscope/hellaswag/summary)               | Active |      |
| `truthful_qa`      | [truthful_qa](https://modelscope.cn/datasets/modelscope/truthful_qa/summary)           | Active |      |
| `competition_math` | [competition_math](https://modelscope.cn/datasets/modelscope/competition_math/summary) | Active |      |
| `humaneval`        | [humaneval](https://modelscope.cn/datasets/modelscope/humaneval/summary)               | Active |      |
| `bbh`              | [bbh](https://modelscope.cn/datasets/modelscope/bbh/summary)                           | Active |      |
| `race`             | [race](https://modelscope.cn/datasets/modelscope/race/summary)                         | Active |      |
| `trivia_qa`        | [trivia_qa](https://modelscope.cn/datasets/modelscope/trivia_qa/summary)               | To be intergrated |      |


## Leaderboard 榜单
ModelScope LLM Leaderboard大模型评测榜单旨在提供一个客观、全面的评估标准和平台,帮助研究人员和开发者了解和比较ModelScope上的模型在各种任务上的性能表现。

[Leaderboard](https://modelscope.cn/leaderboard/58/ranking?type=free)



## 实验和报告
参考: [Experiments](./resources/experiments.md)

## TO-DO List
- [ ] Agents evaluation
- [ ] vLLM
- [ ] Distributed evaluating
- [ ] Multi-modal evaluation
- [ ] Benchmarks
  - [ ] GAIA
  - [ ] GPQA
  - [ ] MBPP
- [ ] Auto-reviewer
  - [ ] Qwen-max

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/modelscope/eval-scope",
    "name": "llmuses",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "python, llm, evaluation",
    "author": "ModelScope team",
    "author_email": "contact@modelscope.cn",
    "download_url": "https://files.pythonhosted.org/packages/c3/c7/d519c94682d198d478c6be4be5c32a464c8a53715e32b75beef57e74a603/llmuses-0.3.0.tar.gz",
    "platform": null,
    "description": "## \u7b80\u4ecb\n\u5927\u578b\u8bed\u8a00\u6a21\u578b\u8bc4\u4f30\uff08LLMs evaluation\uff09\u5df2\u6210\u4e3a\u8bc4\u4ef7\u548c\u6539\u8fdb\u5927\u6a21\u578b\u7684\u91cd\u8981\u6d41\u7a0b\u548c\u624b\u6bb5\uff0c\u4e3a\u4e86\u66f4\u597d\u5730\u652f\u6301\u5927\u6a21\u578b\u7684\u8bc4\u6d4b\uff0c\u6211\u4eec\u63d0\u51fa\u4e86llmuses\u6846\u67b6\uff0c\u8be5\u6846\u67b6\u4e3b\u8981\u5305\u62ec\u4ee5\u4e0b\u51e0\u4e2a\u90e8\u5206\uff1a\n- \u9884\u7f6e\u4e86\u591a\u4e2a\u5e38\u7528\u7684\u6d4b\u8bd5\u57fa\u51c6\u6570\u636e\u96c6\uff0c\u5305\u62ec\uff1aMMLU\u3001CMMLU\u3001C-Eval\u3001GSM8K\u3001ARC\u3001HellaSwag\u3001TruthfulQA\u3001MATH\u3001HumanEval\u7b49\n- \u5e38\u7528\u8bc4\u4f30\u6307\u6807\uff08metrics\uff09\u7684\u5b9e\u73b0\n- \u7edf\u4e00model\u63a5\u5165\uff0c\u517c\u5bb9\u591a\u4e2a\u7cfb\u5217\u6a21\u578b\u7684generate\u3001chat\u63a5\u53e3\n- \u81ea\u52a8\u8bc4\u4f30\uff08evaluator\uff09\uff1a\n    - \u5ba2\u89c2\u9898\u81ea\u52a8\u8bc4\u4f30\n    - \u4f7f\u7528\u4e13\u5bb6\u6a21\u578b\u5b9e\u73b0\u590d\u6742\u4efb\u52a1\u7684\u81ea\u52a8\u8bc4\u4f30\n- \u8bc4\u4f30\u62a5\u544a\u751f\u6210\n- \u7ade\u6280\u573a\u6a21\u5f0f(Arena\uff09\n- \u53ef\u89c6\u5316\u5de5\u5177\n- [\u6a21\u578b\u6027\u80fd\u8bc4\u4f30](llmuses/perf/README.md)\n\n\u7279\u70b9\n- \u8f7b\u91cf\u5316\uff0c\u5c3d\u91cf\u51cf\u5c11\u4e0d\u5fc5\u8981\u7684\u62bd\u8c61\u548c\u914d\u7f6e\n- \u6613\u4e8e\u5b9a\u5236\n  - \u4ec5\u9700\u5b9e\u73b0\u4e00\u4e2a\u7c7b\u5373\u53ef\u63a5\u5165\u65b0\u7684\u6570\u636e\u96c6\n  - \u6a21\u578b\u53ef\u6258\u7ba1\u5728[ModelScope](https://modelscope.cn)\u4e0a\uff0c\u4ec5\u9700model id\u5373\u53ef\u4e00\u952e\u53d1\u8d77\u8bc4\u6d4b\n  - \u652f\u6301\u672c\u5730\u6a21\u578b\u53ef\u90e8\u7f72\u5728\u672c\u5730\n  - \u8bc4\u4f30\u62a5\u544a\u53ef\u89c6\u5316\u5c55\u73b0\n- \u4e30\u5bcc\u7684\u8bc4\u4f30\u6307\u6807\n- model-based\u81ea\u52a8\u8bc4\u4f30\u6d41\u7a0b\uff0c\u652f\u6301\u591a\u79cd\u8bc4\u4f30\u6a21\u5f0f\n  - Single mode: \u4e13\u5bb6\u6a21\u578b\u5bf9\u5355\u4e2a\u6a21\u578b\u6253\u5206\n  - Pairwise-baseline mode: \u4e0e baseline \u6a21\u578b\u5bf9\u6bd4\n  - Pairwise (all) mode: \u5168\u90e8\u6a21\u578b\u4e24\u4e24\u5bf9\u6bd4\n\n\n## \u73af\u5883\u51c6\u5907\n### \u4f7f\u7528pip\u5b89\u88c5\n\u6211\u4eec\u63a8\u8350\u4f7f\u7528conda\u6765\u7ba1\u7406\u73af\u5883\uff0c\u5e76\u4f7f\u7528pip\u5b89\u88c5\u4f9d\u8d56:\n1. \u521b\u5efaconda\u73af\u5883\n```shell\nconda create -n eval-scope python=3.10\nconda activate eval-scope\n```\n2. \u5b89\u88c5\u4f9d\u8d56\n```shell\npip install llmuses\n```\n\n### \u4f7f\u7528\u6e90\u7801\u5b89\u88c5\n1. \u4e0b\u8f7d\u6e90\u7801\n```shell\ngit clone https://github.com/modelscope/eval-scope.git\n```\n2. \u5b89\u88c5\u4f9d\u8d56\n```shell\ncd eval-scope/\npip install -e .\n```\n\n\n## \u5feb\u901f\u5f00\u59cb\n\n### \u7b80\u5355\u8bc4\u4f30\n\u5728\u6307\u5b9a\u7684\u82e5\u5e72\u6570\u636e\u96c6\u4e0a\u8bc4\u4f30\u67d0\u4e2a\u6a21\u578b\uff0c\u6d41\u7a0b\u5982\u4e0b\uff1a\n```shell\npython llmuses/run.py --model ZhipuAI/chatglm3-6b --datasets mmlu ceval --limit 10\n```\n\u5176\u4e2d\uff0c--model\u53c2\u6570\u6307\u5b9a\u4e86\u6a21\u578b\u7684ModelScope model id\uff0c\u6a21\u578b\u94fe\u63a5\uff1a[ZhipuAI/chatglm3-6b](https://modelscope.cn/models/ZhipuAI/chatglm3-6b/summary)\n\n### \u5e26\u53c2\u6570\u8bc4\u4f30\n```shell\npython llmuses/run.py --model ZhipuAI/chatglm3-6b --model-args revision=v1.0.2,precision=torch.float16,device_map=auto --datasets mmlu ceval --mem-cache --limit 10\n\npython llmuses/run.py --model qwen/Qwen-1_8B --generation-config do_sample=false,temperature=0.0 --datasets ceval --dataset-args '{\"ceval\": {\"few_shot_num\": 0, \"few_shot_random\": false}}' --limit 10\n\n# \u53c2\u6570\u8bf4\u660e\n# --model-args: \u6a21\u578b\u53c2\u6570\uff0c\u4ee5\u9017\u53f7\u5206\u9694\uff0ckey=value\u5f62\u5f0f\n# --datasets: \u6570\u636e\u96c6\u540d\u79f0\uff0c\u53c2\u8003\u4e0b\u6587`\u6570\u636e\u96c6\u5217\u8868`\u7ae0\u8282\n# --mem-cache: \u662f\u5426\u4f7f\u7528\u5185\u5b58\u7f13\u5b58\uff0c\u82e5\u5f00\u542f\uff0c\u5219\u5df2\u7ecf\u8dd1\u8fc7\u7684\u6570\u636e\u4f1a\u81ea\u52a8\u7f13\u5b58\uff0c\u5e76\u6301\u4e45\u5316\u5230\u672c\u5730\u78c1\u76d8\n# --limit: \u6bcf\u4e2asubset\u6700\u5927\u8bc4\u4f30\u6570\u636e\u91cf\n# --dataset-args: \u6570\u636e\u96c6\u7684evaluation settings\uff0c\u4ee5json\u683c\u5f0f\u4f20\u5165\uff0ckey\u4e3a\u6570\u636e\u96c6\u540d\u79f0\uff0cvalue\u4e3a\u53c2\u6570\uff0c\u6ce8\u610f\u9700\u8981\u8ddf--datasets\u53c2\u6570\u4e2d\u7684\u503c\u4e00\u4e00\u5bf9\u5e94\n#   -- few_shot_num: few-shot\u7684\u6570\u91cf\n#   -- few_shot_random: \u662f\u5426\u968f\u673a\u91c7\u6837few-shot\u6570\u636e\uff0c\u5982\u679c\u4e0d\u8bbe\u7f6e\uff0c\u5219\u9ed8\u8ba4\u4e3atrue\n```\n\n### \u4f7f\u7528\u672c\u5730\u6570\u636e\u96c6\n\u6570\u636e\u96c6\u9ed8\u8ba4\u6258\u7ba1\u5728[ModelScope](https://modelscope.cn/datasets)\u4e0a\uff0c\u52a0\u8f7d\u9700\u8981\u8054\u7f51\u3002\u5982\u679c\u662f\u65e0\u7f51\u7edc\u73af\u5883\uff0c\u53ef\u4ee5\u4f7f\u7528\u672c\u5730\u6570\u636e\u96c6\uff0c\u6d41\u7a0b\u5982\u4e0b\uff1a\n#### 1. \u4e0b\u8f7d\u6570\u636e\u96c6\u5230\u672c\u5730\n```shell\n# \u5047\u5982\u5f53\u524d\u672c\u5730\u5de5\u4f5c\u8def\u5f84\u4e3a /path/to/workdir\nwget https://modelscope.oss-cn-beijing.aliyuncs.com/open_data/benchmark/data.zip\nunzip data.zip\n# \u5219\u89e3\u538b\u540e\u7684\u6570\u636e\u96c6\u8def\u5f84\u4e3a\uff1a/path/to/workdir/data \u76ee\u5f55\u4e0b\uff0c\u8be5\u76ee\u5f55\u5728\u540e\u7eed\u6b65\u9aa4\u5c06\u4f1a\u4f5c\u4e3a--dataset-dir\u53c2\u6570\u7684\u503c\u4f20\u5165\n```\n#### 2. \u4f7f\u7528\u672c\u5730\u6570\u636e\u96c6\u521b\u5efa\u8bc4\u4f30\u4efb\u52a1\n```shell\npython llmuses/run.py --model ZhipuAI/chatglm3-6b --datasets arc --dataset-hub Local --dataset-dir /path/to/workdir/data --limit 10\n\n# \u53c2\u6570\u8bf4\u660e\n# --dataset-hub: \u6570\u636e\u96c6\u6765\u6e90\uff0c\u679a\u4e3e\u503c\uff1a `ModelScope`, `Local`, `HuggingFace` (TO-DO)  \u9ed8\u8ba4\u4e3a`ModelScope`\n# --dataset-dir: \u5f53--dataset-hub\u4e3a`Local`\u65f6\uff0c\u8be5\u53c2\u6570\u6307\u672c\u5730\u6570\u636e\u96c6\u8def\u5f84; \u5982\u679c--dataset-hub \u8bbe\u7f6e\u4e3a`ModelScope` or `HuggingFace`\uff0c\u5219\u8be5\u53c2\u6570\u7684\u542b\u4e49\u662f\u6570\u636e\u96c6\u7f13\u5b58\u8def\u5f84\u3002\n\n```\n#### 3. (\u53ef\u9009)\u5728\u79bb\u7ebf\u73af\u5883\u52a0\u8f7d\u6a21\u578b\u548c\u8bc4\u6d4b\n\u6a21\u578b\u6587\u4ef6\u6258\u7ba1\u5728ModelScope Hub\u7aef\uff0c\u9700\u8981\u8054\u7f51\u52a0\u8f7d\uff0c\u5f53\u9700\u8981\u5728\u79bb\u7ebf\u73af\u5883\u521b\u5efa\u8bc4\u4f30\u4efb\u52a1\u65f6\uff0c\u53ef\u53c2\u8003\u4ee5\u4e0b\u6b65\u9aa4\uff1a\n```shell\n# 1. \u51c6\u5907\u6a21\u578b\u672c\u5730\u6587\u4ef6\u5939\uff0c\u6587\u4ef6\u5939\u7ed3\u6784\u53c2\u8003chatglm3-6b\uff0c\u94fe\u63a5\uff1ahttps://modelscope.cn/models/ZhipuAI/chatglm3-6b/files\n# \u4f8b\u5982\uff0c\u5c06\u6a21\u578b\u6587\u4ef6\u5939\u6574\u4f53\u4e0b\u8f7d\u5230\u672c\u5730\u8def\u5f84 /path/to/ZhipuAI/chatglm3-6b\n\n# 2. \u6267\u884c\u79bb\u7ebf\u8bc4\u4f30\u4efb\u52a1\npython llmuses/run.py --model /path/to/ZhipuAI/chatglm3-6b --datasets arc --dataset-hub Local --dataset-dir /path/to/workdir/data --limit 10\n\n```\n\n### \u4f7f\u7528run_task\u51fd\u6570\u63d0\u4ea4\u8bc4\u4f30\u4efb\u52a1\nllmuses\u652f\u6301\u901a\u8fc7import\u4f9d\u8d56\u7684\u65b9\u5f0f\u5b9e\u73b0\u4efb\u52a1\u63d0\u4ea4\uff0c\u6b65\u9aa4\u5982\u4e0b\uff1a\n#### 1. \u5b89\u88c5\u4f9d\u8d56\n```shell\n# \u53c2\u8003\u4e0a\u6587`\u73af\u5883\u51c6\u5907`\u7ae0\u8282\uff0c\u5b89\u88c5\u4f9d\u8d56requirements.txt\u4e2d\u7684\u5185\u5bb9\n\n# \u5b89\u88c5llmuses\u5305\npip install https://sail-moe.oss-cn-hangzhou.aliyuncs.com/open_data/packages/llmuses-0.2.6-py3-none-any.whl\n```\n\n#### 2. \u914d\u7f6e\u4efb\u52a1\n```python\nimport torch\nfrom llmuses.constants import DEFAULT_ROOT_CACHE_DIR\n\n# \u793a\u4f8b\nyour_task_cfg = {\n        'model_args': {'revision': None, 'precision': torch.float16, 'device_map': 'auto'},\n        'generation_config': {'do_sample': False, 'repetition_penalty': 1.0, 'max_new_tokens': 512},\n        'dataset_args': {},\n        'dry_run': False,\n        'model': 'ZhipuAI/chatglm3-6b',\n        'datasets': ['arc', 'hellaswag'],\n        'work_dir': DEFAULT_ROOT_CACHE_DIR,\n        'outputs': DEFAULT_ROOT_CACHE_DIR,\n        'mem_cache': False,\n        'dataset_hub': 'ModelScope',\n        'dataset_dir': DEFAULT_ROOT_CACHE_DIR,\n        'stage': 'all',\n        'limit': 10,\n        'debug': False\n    }\n\n```\n\n#### 3. \u6267\u884c\u4efb\u52a1\n```python\nfrom llmuses.run import run_task\n\nrun_task(task_cfg=your_task_cfg)\n```\n\n\n### \u7ade\u6280\u573a\u6a21\u5f0f\uff08Arena\uff09\n\u7ade\u6280\u573a\u6a21\u5f0f\u5141\u8bb8\u591a\u4e2a\u5019\u9009\u6a21\u578b\u901a\u8fc7\u4e24\u4e24\u5bf9\u6bd4(pairwise battle)\u7684\u65b9\u5f0f\u8fdb\u884c\u8bc4\u4f30\uff0c\u5e76\u53ef\u4ee5\u9009\u62e9\u501f\u52a9AI Enhanced Auto-Reviewer\uff08AAR\uff09\u81ea\u52a8\u8bc4\u4f30\u6d41\u7a0b\u6216\u8005\u4eba\u5de5\u8bc4\u4f30\u7684\u65b9\u5f0f\uff0c\u6700\u7ec8\u5f97\u5230\u8bc4\u4f30\u62a5\u544a\uff0c\u6d41\u7a0b\u793a\u4f8b\u5982\u4e0b\uff1a\n#### 1. \u73af\u5883\u51c6\u5907\n```text\na. \u6570\u636e\u51c6\u5907\uff0cquestions data\u683c\u5f0f\u53c2\u8003\uff1allmuses/registry/data/question.jsonl\nb. \u5982\u679c\u9700\u8981\u4f7f\u7528\u81ea\u52a8\u8bc4\u4f30\u6d41\u7a0b\uff08AAR\uff09\uff0c\u5219\u9700\u8981\u914d\u7f6e\u76f8\u5173\u73af\u5883\u53d8\u91cf\uff0c\u6211\u4eec\u4ee5GPT-4 based auto-reviewer\u6d41\u7a0b\u4e3a\u4f8b\uff0c\u9700\u8981\u914d\u7f6e\u4ee5\u4e0b\u73af\u5883\u53d8\u91cf\uff1a\n> export OPENAI_API_KEY=YOUR_OPENAI_API_KEY\n```\n\n#### 2. \u914d\u7f6e\u6587\u4ef6\n```text\narena\u8bc4\u4f30\u6d41\u7a0b\u7684\u914d\u7f6e\u6587\u4ef6\u53c2\u8003\uff1a llmuses/registry/config/cfg_arena.yaml\n\u5b57\u6bb5\u8bf4\u660e\uff1a\n    questions_file: question data\u7684\u8def\u5f84\n    answers_gen: \u5019\u9009\u6a21\u578b\u9884\u6d4b\u7ed3\u679c\u751f\u6210\uff0c\u652f\u6301\u591a\u4e2a\u6a21\u578b\uff0c\u53ef\u901a\u8fc7enable\u53c2\u6570\u63a7\u5236\u662f\u5426\u5f00\u542f\u8be5\u6a21\u578b\n    reviews_gen: \u8bc4\u4f30\u7ed3\u679c\u751f\u6210\uff0c\u76ee\u524d\u9ed8\u8ba4\u4f7f\u7528GPT-4\u4f5c\u4e3aAuto-reviewer\uff0c\u53ef\u901a\u8fc7enable\u53c2\u6570\u63a7\u5236\u662f\u5426\u5f00\u542f\u8be5\u6b65\u9aa4\n    elo_rating: ELO rating \u7b97\u6cd5\uff0c\u53ef\u901a\u8fc7enable\u53c2\u6570\u63a7\u5236\u662f\u5426\u5f00\u542f\u8be5\u6b65\u9aa4\uff0c\u6ce8\u610f\u8be5\u6b65\u9aa4\u4f9d\u8d56review_file\u5fc5\u987b\u5b58\u5728\n```\n\n#### 3. \u6267\u884c\u811a\u672c\n```shell\n#Usage:\ncd llmuses\n\n# dry-run\u6a21\u5f0f (\u6a21\u578banswer\u6b63\u5e38\u751f\u6210\uff0c\u4f46\u4e13\u5bb6\u6a21\u578b\u4e0d\u4f1a\u88ab\u89e6\u53d1\uff0c\u8bc4\u4f30\u7ed3\u679c\u4f1a\u968f\u673a\u751f\u6210)\npython llmuses/run_arena.py -c registry/config/cfg_arena.yaml --dry-run\n\n# \u6267\u884c\u8bc4\u4f30\u6d41\u7a0b\npython llmuses/run_arena.py --c registry/config/cfg_arena.yaml\n```\n\n#### 4. \u7ed3\u679c\u53ef\u89c6\u5316\n\n```shell\n# Usage:\nstreamlit run viz.py -- --review-file llmuses/registry/data/qa_browser/battle.jsonl --category-file llmuses/registry/data/qa_browser/category_mapping.yaml\n```\n\n\n### \u5355\u6a21\u578b\u6253\u5206\u6a21\u5f0f\uff08Single mode\uff09\n\n\u8fd9\u4e2a\u6a21\u5f0f\u4e0b\uff0c\u6211\u4eec\u53ea\u5bf9\u5355\u4e2a\u6a21\u578b\u8f93\u51fa\u505a\u6253\u5206\uff0c\u4e0d\u505a\u4e24\u4e24\u5bf9\u6bd4\u3002\n#### 1. \u914d\u7f6e\u6587\u4ef6\n```text\n\u8bc4\u4f30\u6d41\u7a0b\u7684\u914d\u7f6e\u6587\u4ef6\u53c2\u8003\uff1a llmuses/registry/config/cfg_single.yaml\n\u5b57\u6bb5\u8bf4\u660e\uff1a\n    questions_file: question data\u7684\u8def\u5f84\n    answers_gen: \u5019\u9009\u6a21\u578b\u9884\u6d4b\u7ed3\u679c\u751f\u6210\uff0c\u652f\u6301\u591a\u4e2a\u6a21\u578b\uff0c\u53ef\u901a\u8fc7enable\u53c2\u6570\u63a7\u5236\u662f\u5426\u5f00\u542f\u8be5\u6a21\u578b\n    reviews_gen: \u8bc4\u4f30\u7ed3\u679c\u751f\u6210\uff0c\u76ee\u524d\u9ed8\u8ba4\u4f7f\u7528GPT-4\u4f5c\u4e3aAuto-reviewer\uff0c\u53ef\u901a\u8fc7enable\u53c2\u6570\u63a7\u5236\u662f\u5426\u5f00\u542f\u8be5\u6b65\u9aa4\n    rating_gen: rating \u7b97\u6cd5\uff0c\u53ef\u901a\u8fc7enable\u53c2\u6570\u63a7\u5236\u662f\u5426\u5f00\u542f\u8be5\u6b65\u9aa4\uff0c\u6ce8\u610f\u8be5\u6b65\u9aa4\u4f9d\u8d56review_file\u5fc5\u987b\u5b58\u5728\n```\n#### 2. \u6267\u884c\u811a\u672c\n```shell\n#Example:\npython llmuses/run_arena.py --c registry/config/cfg_single.yaml\n```\n\n### Baseline\u6a21\u578b\u5bf9\u6bd4\u6a21\u5f0f\uff08Pairwise-baseline mode\uff09\n\n\u8fd9\u4e2a\u6a21\u5f0f\u4e0b\uff0c\u6211\u4eec\u9009\u5b9a baseline \u6a21\u578b\uff0c\u5176\u4ed6\u6a21\u578b\u4e0e baseline \u6a21\u578b\u505a\u5bf9\u6bd4\u8bc4\u5206\u3002\u8fd9\u4e2a\u6a21\u5f0f\u53ef\u4ee5\u65b9\u4fbf\u7684\u628a\u65b0\u6a21\u578b\u52a0\u5165\u5230 Leaderboard \u4e2d\uff08\u53ea\u9700\u8981\u5bf9\u65b0\u6a21\u578b\u8ddf baseline \u6a21\u578b\u8dd1\u4e00\u904d\u6253\u5206\u5373\u53ef\uff09\n#### 1. \u914d\u7f6e\u6587\u4ef6\n```text\n\u8bc4\u4f30\u6d41\u7a0b\u7684\u914d\u7f6e\u6587\u4ef6\u53c2\u8003\uff1a llmuses/registry/config/cfg_pairwise_baseline.yaml\n\u5b57\u6bb5\u8bf4\u660e\uff1a\n    questions_file: question data\u7684\u8def\u5f84\n    answers_gen: \u5019\u9009\u6a21\u578b\u9884\u6d4b\u7ed3\u679c\u751f\u6210\uff0c\u652f\u6301\u591a\u4e2a\u6a21\u578b\uff0c\u53ef\u901a\u8fc7enable\u53c2\u6570\u63a7\u5236\u662f\u5426\u5f00\u542f\u8be5\u6a21\u578b\n    reviews_gen: \u8bc4\u4f30\u7ed3\u679c\u751f\u6210\uff0c\u76ee\u524d\u9ed8\u8ba4\u4f7f\u7528GPT-4\u4f5c\u4e3aAuto-reviewer\uff0c\u53ef\u901a\u8fc7enable\u53c2\u6570\u63a7\u5236\u662f\u5426\u5f00\u542f\u8be5\u6b65\u9aa4\n    rating_gen: rating \u7b97\u6cd5\uff0c\u53ef\u901a\u8fc7enable\u53c2\u6570\u63a7\u5236\u662f\u5426\u5f00\u542f\u8be5\u6b65\u9aa4\uff0c\u6ce8\u610f\u8be5\u6b65\u9aa4\u4f9d\u8d56review_file\u5fc5\u987b\u5b58\u5728\n```\n#### 2. \u6267\u884c\u811a\u672c\n```shell\n# Example:\npython llmuses/run_arena.py --c llmuses/registry/config/cfg_pairwise_baseline.yaml\n```\n\n\n## \u6570\u636e\u96c6\u5217\u8868\n\n| DatasetName        | Link                                                                                   | Status | Note |\n|--------------------|----------------------------------------------------------------------------------------|--------|------|\n| `mmlu`             | [mmlu](https://modelscope.cn/datasets/modelscope/mmlu/summary)                         | Active |      |\n| `ceval`            | [ceval](https://modelscope.cn/datasets/modelscope/ceval-exam/summary)                  | Active |      |\n| `gsm8k`            | [gsm8k](https://modelscope.cn/datasets/modelscope/gsm8k/summary)                       | Active |      |\n| `arc`              | [arc](https://modelscope.cn/datasets/modelscope/ai2_arc/summary)                       | Active |      |\n| `hellaswag`        | [hellaswag](https://modelscope.cn/datasets/modelscope/hellaswag/summary)               | Active |      |\n| `truthful_qa`      | [truthful_qa](https://modelscope.cn/datasets/modelscope/truthful_qa/summary)           | Active |      |\n| `competition_math` | [competition_math](https://modelscope.cn/datasets/modelscope/competition_math/summary) | Active |      |\n| `humaneval`        | [humaneval](https://modelscope.cn/datasets/modelscope/humaneval/summary)               | Active |      |\n| `bbh`              | [bbh](https://modelscope.cn/datasets/modelscope/bbh/summary)                           | Active |      |\n| `race`             | [race](https://modelscope.cn/datasets/modelscope/race/summary)                         | Active |      |\n| `trivia_qa`        | [trivia_qa](https://modelscope.cn/datasets/modelscope/trivia_qa/summary)               | To be intergrated |      |\n\n\n## Leaderboard \u699c\u5355\nModelScope LLM Leaderboard\u5927\u6a21\u578b\u8bc4\u6d4b\u699c\u5355\u65e8\u5728\u63d0\u4f9b\u4e00\u4e2a\u5ba2\u89c2\u3001\u5168\u9762\u7684\u8bc4\u4f30\u6807\u51c6\u548c\u5e73\u53f0\uff0c\u5e2e\u52a9\u7814\u7a76\u4eba\u5458\u548c\u5f00\u53d1\u8005\u4e86\u89e3\u548c\u6bd4\u8f83ModelScope\u4e0a\u7684\u6a21\u578b\u5728\u5404\u79cd\u4efb\u52a1\u4e0a\u7684\u6027\u80fd\u8868\u73b0\u3002\n\n[Leaderboard](https://modelscope.cn/leaderboard/58/ranking?type=free)\n\n\n\n## \u5b9e\u9a8c\u548c\u62a5\u544a\n\u53c2\u8003\uff1a [Experiments](./resources/experiments.md)\n\n## TO-DO List\n- [ ] Agents evaluation\n- [ ] vLLM\n- [ ] Distributed evaluating\n- [ ] Multi-modal evaluation\n- [ ] Benchmarks\n  - [ ] GAIA\n  - [ ] GPQA\n  - [ ] MBPP\n- [ ] Auto-reviewer\n  - [ ] Qwen-max\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Eval-Scope: Lightweight LLMs Evaluation Framework",
    "version": "0.3.0",
    "project_urls": {
        "Homepage": "https://github.com/modelscope/eval-scope"
    },
    "split_keywords": [
        "python",
        " llm",
        " evaluation"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0bafe011139597df195df02bf75cd5ad2d021d10541a311a4e7c47aadd77dad5",
                "md5": "5164e3c405a20c95ad4ce8f808c0004d",
                "sha256": "4360918074af93da0a9cd6343181c5caa35f74eff7bb36a4785888f1993628c0"
            },
            "downloads": -1,
            "filename": "llmuses-0.3.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "5164e3c405a20c95ad4ce8f808c0004d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 194938,
            "upload_time": "2024-04-11T06:27:04",
            "upload_time_iso_8601": "2024-04-11T06:27:04.103561Z",
            "url": "https://files.pythonhosted.org/packages/0b/af/e011139597df195df02bf75cd5ad2d021d10541a311a4e7c47aadd77dad5/llmuses-0.3.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c3c7d519c94682d198d478c6be4be5c32a464c8a53715e32b75beef57e74a603",
                "md5": "e0f02de655425ba0037426516567aff8",
                "sha256": "00fe4b72668324dfb03108771ad0ff470efd12b5ae8a62bafa686725339eb4d5"
            },
            "downloads": -1,
            "filename": "llmuses-0.3.0.tar.gz",
            "has_sig": false,
            "md5_digest": "e0f02de655425ba0037426516567aff8",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 141974,
            "upload_time": "2024-04-11T06:27:06",
            "upload_time_iso_8601": "2024-04-11T06:27:06.881188Z",
            "url": "https://files.pythonhosted.org/packages/c3/c7/d519c94682d198d478c6be4be5c32a464c8a53715e32b75beef57e74a603/llmuses-0.3.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-11 06:27:06",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "modelscope",
    "github_project": "eval-scope",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "llmuses"
}
        
Elapsed time: 0.21495s