# eval




Python Library for Evaluation
## What is Evaluation?
Evaluation allows us to assess how a given model is performing against a set of specific tasks. This is done by running a set of standardized benchmark tests against
the model. Running evaluation produces numerical scores across these various benchmarks, as well as logs excerpts/samples of the outputs the model produced during these
benchmarks. Using a combination of these artifacts as reference, along with a manual smoke test, allows us to get the best idea about whether or not a model has learned
and improved on something we are trying to teach it. There are 2 stages of model evaluation in the InstructLab process:
### Inter-checkpoint Evaluation
This step occurs during multi-phase training. Each phase of training produces multiple different “checkpoints” of the model that are taken at various stages during
the phase. At the end of each phase, we evaluate all the checkpoints in order to find the one that provides the best results. This is done as part of the
[InstructLab Training](https://github.com/instructlab/training) library.
### Full-scale final Evaluation
Once training is complete, and we have picked the best checkpoint from the output of the final phase, we can run full-scale evaluation suite which runs MT-Bench, MMLU,
MT-Bench Branch and MMLU Branch.
## Methods of Evaluation
Below are more in-depth explanations of the suite of benchmarks we are using as methods for evaluation of models.
### Multi-turn benchmark (MT-Bench)
**tl;dr** Full model evaluation of performance on **skills**
MT-Bench is a type of benchmarking that involves asking a model 80 multi-turn questions - i.e.
```text
<Question 1> → <model’s answer 1> → <Follow-up question> → <model’s answer 2>
```
A “judge” model reviews the given multi-turn question, the provided model answer, and rate the answer with a score out of 10. The scores are then averaged out
and the final score produced is the “MT-bench score” for that model. This benchmark assumes no factual knowledge on the model’s part. The questions are static, but do not get obsolete with time.
You can read more about MT-Bench [here](https://arxiv.org/abs/2306.05685)
### MT-Bench Branch
MT-Bench Branch is an adaptation of MT-Bench that is designed to test custom skills that are added to the model with the InstructLab project. These new skills
come in the form of question/answer pairs in a Git branch of the [taxonomy](https://github.com/instructlab/taxonomy).
MT-Bench Branch uses the user supplied seed questions to have the candidate model generate answers to, which are then judged by the judge model using the user supplied
seed answers as a reference.
### Massive Multitask Language Understanding (MMLU)
**tl;dr** Full model evaluation of performance on **knowledge**
MMLU is a type of benchmarking that involves a series of fact-based multiple choice questions, along with 4 options for answers. It tests if a model is able to interpret
the questions correctly, along the answers, formulate its own answer, then selects the correct option out of the provided ones. The questions are designed as a set
of 57 “tasks”, and each task has a given domain. The domains cover a number of topics ranging from Chemistry and Biology to US History and Math.
The performance number is then compared against the set of known correct answers for each question to determine how many the model got right. The final MMLU score is the
average of its scores. This benchmark does not involve any reference/critic model, and is a completely objective benchmark. This benchmark does assume factual knowledge
on the model’s part. The questions are static, therefore MMLU cannot be used to gauge the model’s knowledge on more recent topics.
InstructLab uses an implementation found [here](https://github.com/EleutherAI/lm-evaluation-harness) for running MMLU.
You can read more about MMLU [here](https://arxiv.org/abs/2306.05685)
### MMLU Branch
MMLU Branch is an adaptation of MMLU that is designed to test custom knowledge that is being added to the model via a Git branch of the [taxonomy](https://github.com/instructlab/taxonomy).
A teacher model is used to generate new multiple choice questions based on the knowledge document included in the taxonomy Git branch. A “task” is then constructed that references the newly generated answer choices. These tasks are then used to score the model’s grasp on new knowledge the same way MMLU works. Generation of these tasks are done as part of the [InstructLab SDG](https://github.com/instructlab/sdg) library.
## MT-Bench / MT-Bench Branch Testing Steps
> **⚠️ Note:** Must use Python version 3.10 or later.
```shell
# Optional: Use cloud-instance.sh (https://github.com/instructlab/instructlab/tree/main/scripts/infra) to launch and setup the instance
scripts/infra/cloud-instance.sh ec2 launch -t g5.4xlarge
scripts/infra/cloud-instance.sh ec2 setup-rh-devenv
scripts/infra/cloud-instance.sh ec2 install-rh-nvidia-drivers
scripts/infra/cloud-instance.sh ec2 ssh sudo reboot
scripts/infra/cloud-instance.sh ec2 ssh
# Regardless of how you setup your instance
git clone https://github.com/instructlab/taxonomy.git && pushd taxonomy && git branch rc && popd
git clone --bare https://github.com/instructlab/eval.git && git clone eval.git/ && cd eval && git remote add syncrepo ../eval.git
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
pip install -r requirements-dev.txt
pip install -e .
pip install vllm
python -m vllm.entrypoints.openai.api_server --model instructlab/granite-7b-lab --tensor-parallel-size 1
```
In another shell window
```shell
export INSTRUCTLAB_EVAL_FIRST_N_QUESTIONS=10 # Optional if you want to shorten run times
# Commands relative to eval directory
python3 tests/test_gen_answers.py
python3 tests/test_branch_gen_answers.py
```
Example output tree
```shell
eval_output/
├── mt_bench
│ └── model_answer
│ └── instructlab
│ └── granite-7b-lab.jsonl
└── mt_bench_branch
├── main
│ ├── model_answer
│ │ └── instructlab
│ │ └── granite-7b-lab.jsonl
│ ├── question.jsonl
│ └── reference_answer
│ └── instructlab
│ └── granite-7b-lab.jsonl
└── rc
├── model_answer
│ └── instructlab
│ └── granite-7b-lab.jsonl
├── question.jsonl
└── reference_answer
└── instructlab
└── granite-7b-lab.jsonl
```
```shell
python3 tests/test_judge_answers.py
python3 tests/test_branch_judge_answers.py
```
Example output tree
```shell
eval_output/
├── mt_bench
│ ├── model_answer
│ │ └── instructlab
│ │ └── granite-7b-lab.jsonl
│ └── model_judgment
│ └── instructlab
│ └── granite-7b-lab_single.jsonl
└── mt_bench_branch
├── main
│ ├── model_answer
│ │ └── instructlab
│ │ └── granite-7b-lab.jsonl
│ ├── model_judgment
│ │ └── instructlab
│ │ └── granite-7b-lab_single.jsonl
│ ├── question.jsonl
│ └── reference_answer
│ └── instructlab
│ └── granite-7b-lab.jsonl
└── rc
├── model_answer
│ └── instructlab
│ └── granite-7b-lab.jsonl
├── model_judgment
│ └── instructlab
│ └── granite-7b-lab_single.jsonl
├── question.jsonl
└── reference_answer
└── instructlab
└── granite-7b-lab.jsonl
```
Raw data
{
"_id": null,
"home_page": null,
"name": "instructlab-eval",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": null,
"author": null,
"author_email": "InstructLab <dev@instructlab.ai>",
"download_url": "https://files.pythonhosted.org/packages/86/b5/c042d297e414af6518a80cdf803b14042e5cea36131c9fe2dca6d4809a25/instructlab_eval-0.3.1.tar.gz",
"platform": null,
"description": "# eval\n\n\n\n\n\n\nPython Library for Evaluation\n\n## What is Evaluation?\n\nEvaluation allows us to assess how a given model is performing against a set of specific tasks. This is done by running a set of standardized benchmark tests against\nthe model. Running evaluation produces numerical scores across these various benchmarks, as well as logs excerpts/samples of the outputs the model produced during these\nbenchmarks. Using a combination of these artifacts as reference, along with a manual smoke test, allows us to get the best idea about whether or not a model has learned\nand improved on something we are trying to teach it. There are 2 stages of model evaluation in the InstructLab process:\n\n### Inter-checkpoint Evaluation\n\nThis step occurs during multi-phase training. Each phase of training produces multiple different \u201ccheckpoints\u201d of the model that are taken at various stages during\nthe phase. At the end of each phase, we evaluate all the checkpoints in order to find the one that provides the best results. This is done as part of the\n[InstructLab Training](https://github.com/instructlab/training) library.\n\n### Full-scale final Evaluation\n\nOnce training is complete, and we have picked the best checkpoint from the output of the final phase, we can run full-scale evaluation suite which runs MT-Bench, MMLU,\nMT-Bench Branch and MMLU Branch.\n\n## Methods of Evaluation\n\nBelow are more in-depth explanations of the suite of benchmarks we are using as methods for evaluation of models.\n\n### Multi-turn benchmark (MT-Bench)\n\n**tl;dr** Full model evaluation of performance on **skills**\n\nMT-Bench is a type of benchmarking that involves asking a model 80 multi-turn questions - i.e.\n\n```text\n<Question 1> \u2192 <model\u2019s answer 1> \u2192 <Follow-up question> \u2192 <model\u2019s answer 2>\n```\n\nA \u201cjudge\u201d model reviews the given multi-turn question, the provided model answer, and rate the answer with a score out of 10. The scores are then averaged out\nand the final score produced is the \u201cMT-bench score\u201d for that model. This benchmark assumes no factual knowledge on the model\u2019s part. The questions are static, but do not get obsolete with time.\n\nYou can read more about MT-Bench [here](https://arxiv.org/abs/2306.05685)\n\n### MT-Bench Branch\n\nMT-Bench Branch is an adaptation of MT-Bench that is designed to test custom skills that are added to the model with the InstructLab project. These new skills\ncome in the form of question/answer pairs in a Git branch of the [taxonomy](https://github.com/instructlab/taxonomy).\n\nMT-Bench Branch uses the user supplied seed questions to have the candidate model generate answers to, which are then judged by the judge model using the user supplied\nseed answers as a reference.\n\n### Massive Multitask Language Understanding (MMLU)\n\n**tl;dr** Full model evaluation of performance on **knowledge**\n\nMMLU is a type of benchmarking that involves a series of fact-based multiple choice questions, along with 4 options for answers. It tests if a model is able to interpret\nthe questions correctly, along the answers, formulate its own answer, then selects the correct option out of the provided ones. The questions are designed as a set\nof 57 \u201ctasks\u201d, and each task has a given domain. The domains cover a number of topics ranging from Chemistry and Biology to US History and Math.\n\nThe performance number is then compared against the set of known correct answers for each question to determine how many the model got right. The final MMLU score is the\naverage of its scores. This benchmark does not involve any reference/critic model, and is a completely objective benchmark. This benchmark does assume factual knowledge\non the model\u2019s part. The questions are static, therefore MMLU cannot be used to gauge the model\u2019s knowledge on more recent topics.\n\nInstructLab uses an implementation found [here](https://github.com/EleutherAI/lm-evaluation-harness) for running MMLU.\n\nYou can read more about MMLU [here](https://arxiv.org/abs/2306.05685)\n\n### MMLU Branch\n\nMMLU Branch is an adaptation of MMLU that is designed to test custom knowledge that is being added to the model via a Git branch of the [taxonomy](https://github.com/instructlab/taxonomy).\n\nA teacher model is used to generate new multiple choice questions based on the knowledge document included in the taxonomy Git branch. A \u201ctask\u201d is then constructed that references the newly generated answer choices. These tasks are then used to score the model\u2019s grasp on new knowledge the same way MMLU works. Generation of these tasks are done as part of the [InstructLab SDG](https://github.com/instructlab/sdg) library.\n\n## MT-Bench / MT-Bench Branch Testing Steps\n\n> **\u26a0\ufe0f Note:** Must use Python version 3.10 or later.\n\n```shell\n# Optional: Use cloud-instance.sh (https://github.com/instructlab/instructlab/tree/main/scripts/infra) to launch and setup the instance\nscripts/infra/cloud-instance.sh ec2 launch -t g5.4xlarge\nscripts/infra/cloud-instance.sh ec2 setup-rh-devenv\nscripts/infra/cloud-instance.sh ec2 install-rh-nvidia-drivers\nscripts/infra/cloud-instance.sh ec2 ssh sudo reboot\nscripts/infra/cloud-instance.sh ec2 ssh\n\n\n# Regardless of how you setup your instance\ngit clone https://github.com/instructlab/taxonomy.git && pushd taxonomy && git branch rc && popd\ngit clone --bare https://github.com/instructlab/eval.git && git clone eval.git/ && cd eval && git remote add syncrepo ../eval.git\npython3 -m venv venv\nsource venv/bin/activate\npip install -r requirements.txt\npip install -r requirements-dev.txt\npip install -e .\npip install vllm\npython -m vllm.entrypoints.openai.api_server --model instructlab/granite-7b-lab --tensor-parallel-size 1\n```\n\nIn another shell window\n\n```shell\nexport INSTRUCTLAB_EVAL_FIRST_N_QUESTIONS=10 # Optional if you want to shorten run times\n# Commands relative to eval directory\npython3 tests/test_gen_answers.py\npython3 tests/test_branch_gen_answers.py\n```\n\nExample output tree\n\n```shell\neval_output/\n\u251c\u2500\u2500 mt_bench\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 model_answer\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 instructlab\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 granite-7b-lab.jsonl\n\u2514\u2500\u2500 mt_bench_branch\n \u251c\u2500\u2500 main\n \u2502\u00a0\u00a0 \u251c\u2500\u2500 model_answer\n \u2502\u00a0\u00a0 \u2502\u00a0\u00a0 \u2514\u2500\u2500 instructlab\n \u2502\u00a0\u00a0 \u2502\u00a0\u00a0 \u2514\u2500\u2500 granite-7b-lab.jsonl\n \u2502\u00a0\u00a0 \u251c\u2500\u2500 question.jsonl\n \u2502\u00a0\u00a0 \u2514\u2500\u2500 reference_answer\n \u2502\u00a0\u00a0 \u2514\u2500\u2500 instructlab\n \u2502\u00a0\u00a0 \u2514\u2500\u2500 granite-7b-lab.jsonl\n \u2514\u2500\u2500 rc\n \u251c\u2500\u2500 model_answer\n \u2502\u00a0\u00a0 \u2514\u2500\u2500 instructlab\n \u2502\u00a0\u00a0 \u2514\u2500\u2500 granite-7b-lab.jsonl\n \u251c\u2500\u2500 question.jsonl\n \u2514\u2500\u2500 reference_answer\n \u2514\u2500\u2500 instructlab\n \u2514\u2500\u2500 granite-7b-lab.jsonl\n```\n\n```shell\npython3 tests/test_judge_answers.py\npython3 tests/test_branch_judge_answers.py\n```\n\nExample output tree\n\n```shell\neval_output/\n\u251c\u2500\u2500 mt_bench\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 model_answer\n\u2502\u00a0\u00a0 \u2502\u00a0\u00a0 \u2514\u2500\u2500 instructlab\n\u2502\u00a0\u00a0 \u2502\u00a0\u00a0 \u2514\u2500\u2500 granite-7b-lab.jsonl\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 model_judgment\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 instructlab\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 granite-7b-lab_single.jsonl\n\u2514\u2500\u2500 mt_bench_branch\n \u251c\u2500\u2500 main\n \u2502\u00a0\u00a0 \u251c\u2500\u2500 model_answer\n \u2502\u00a0\u00a0 \u2502\u00a0\u00a0 \u2514\u2500\u2500 instructlab\n \u2502\u00a0\u00a0 \u2502\u00a0\u00a0 \u2514\u2500\u2500 granite-7b-lab.jsonl\n \u2502\u00a0\u00a0 \u251c\u2500\u2500 model_judgment\n \u2502\u00a0\u00a0 \u2502\u00a0\u00a0 \u2514\u2500\u2500 instructlab\n \u2502\u00a0\u00a0 \u2502\u00a0\u00a0 \u2514\u2500\u2500 granite-7b-lab_single.jsonl\n \u2502\u00a0\u00a0 \u251c\u2500\u2500 question.jsonl\n \u2502\u00a0\u00a0 \u2514\u2500\u2500 reference_answer\n \u2502\u00a0\u00a0 \u2514\u2500\u2500 instructlab\n \u2502\u00a0\u00a0 \u2514\u2500\u2500 granite-7b-lab.jsonl\n \u2514\u2500\u2500 rc\n \u251c\u2500\u2500 model_answer\n \u2502\u00a0\u00a0 \u2514\u2500\u2500 instructlab\n \u2502\u00a0\u00a0 \u2514\u2500\u2500 granite-7b-lab.jsonl\n \u251c\u2500\u2500 model_judgment\n \u2502\u00a0\u00a0 \u2514\u2500\u2500 instructlab\n \u2502\u00a0\u00a0 \u2514\u2500\u2500 granite-7b-lab_single.jsonl\n \u251c\u2500\u2500 question.jsonl\n \u2514\u2500\u2500 reference_answer\n \u2514\u2500\u2500 instructlab\n \u2514\u2500\u2500 granite-7b-lab.jsonl\n```\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "Evaluation",
"version": "0.3.1",
"project_urls": {
"homepage": "https://instructlab.ai",
"issues": "https://github.com/instructlab/eval/issues",
"source": "https://github.com/instructlab/eval"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "169fbb2274a05f7ab4193aeaaa406f8bf13f96be148ca2852b583d4400ac9acc",
"md5": "9542fad48b5b1b8641fa33c0350b0bb3",
"sha256": "68f4d4a1b97bfdbf42c4e0fac73872bf7a43534ea9f045e8f73f97ebf10d2ca4"
},
"downloads": -1,
"filename": "instructlab_eval-0.3.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "9542fad48b5b1b8641fa33c0350b0bb3",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 65248,
"upload_time": "2024-10-01T01:46:24",
"upload_time_iso_8601": "2024-10-01T01:46:24.483645Z",
"url": "https://files.pythonhosted.org/packages/16/9f/bb2274a05f7ab4193aeaaa406f8bf13f96be148ca2852b583d4400ac9acc/instructlab_eval-0.3.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "86b5c042d297e414af6518a80cdf803b14042e5cea36131c9fe2dca6d4809a25",
"md5": "f8feb28dd346579b3ff0a6ff177473e3",
"sha256": "57c4d732dd1214cdac1b2f7c424a22af215243e3ec5da9d566160b61374d55d3"
},
"downloads": -1,
"filename": "instructlab_eval-0.3.1.tar.gz",
"has_sig": false,
"md5_digest": "f8feb28dd346579b3ff0a6ff177473e3",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 100830,
"upload_time": "2024-10-01T01:46:25",
"upload_time_iso_8601": "2024-10-01T01:46:25.931885Z",
"url": "https://files.pythonhosted.org/packages/86/b5/c042d297e414af6518a80cdf803b14042e5cea36131c9fe2dca6d4809a25/instructlab_eval-0.3.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-01 01:46:25",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "instructlab",
"github_project": "eval",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "GitPython",
"specs": [
[
">=",
"3.1.42"
],
[
"<",
"4.0.0"
]
]
},
{
"name": "shortuuid",
"specs": []
},
{
"name": "openai",
"specs": [
[
">=",
"1.13.3"
],
[
"<",
"2.0.0"
]
]
},
{
"name": "psutil",
"specs": []
},
{
"name": "torch",
"specs": []
},
{
"name": "transformers",
"specs": []
},
{
"name": "accelerate",
"specs": []
},
{
"name": "pandas",
"specs": []
},
{
"name": "pandas-stubs",
"specs": []
},
{
"name": "lm-eval",
"specs": [
[
">=",
"0.4.4"
]
]
}
],
"tox": true,
"lcname": "instructlab-eval"
}