easy-evaluator

Name	easy-evaluator JSON
Version	0.0.0 JSON
	download
home_page	https://github.com/Anindyadeep/easy_eval
Summary	A library for easy evaluation of language models
upload_time	2024-03-03 15:30:15
maintainer
docs_url	None
author	Anindyadeep
requires_python
license
keywords	llm evaluation openai
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # EasyEval

EasyEval is a fully open-source evaluation wrapper that aims to streamline the integration, customization, and expansion of robust evaluation engines like [lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness) and [bigcode-eval-harness](https://github.com/bigcode-project/bigcode-evaluation-harness) into existing production-grade or research pipelines effortlessly. It supports over 200 existing datasets and can be easily adapted for custom ones, making it a versatile solution for enhancing evaluation processes.

### But Why?

Evaluation has been open-problem for LLMs. When evaluating LLMs into production, we need to rely on different evaluation techniques. However the problem that we lot of times face is to integrate good evaluation engines into different existing production LLM pipelines. 

So what are the solutions:

1. Either go for an enterprise solution.
2. Or look for Open Source solutions. 

Now there are some handful of open-soure libraries that does evaluation on large scale evaluation datasets. Some of the examples are:

1. [LM Evaluation Harness by Eleuther AI](https://github.com/EleutherAI/lm-evaluation-harness)
2. [BigCode Evaluation Harness by the BigCode Project](https://github.com/bigcode-project/bigcode-evaluation-harness)
3. [Stanford HELM](https://crfm.stanford.edu/helm/lite/latest/)
4. [OpenCompass](https://opencompass.org.cn/home)

Other than that we have tons and tons of evaluation libraries where a huge percentage is an extension of the above engines. The way this engine works they define some taxonomy of how they evaluate. 

For example: LM Evaluation Harness by Eleuther AI defines different tasks and under each task we have different datasets. We use the "test/evaluation" split of the datasets to evaluate the LLM of choice. 

### The problem

The problem with these evaluators is, most of them are CLI first. They expose very little documentation on their actual API interfaces. These libraries becomes super useful if they can be easily integrated or extended or customized with newer tasks in existing production pipelines. Production pipelines like:

1. Making evaluation REST-API servers
2. CI/CD pipelines for evaluation for LLM fine-tuning
3. Leaderboard generations to compare across checkpoints or different LLMs.
4. Supporting any custom model or engine. Example TensorRT or any API endpoint.
5. GPT as evalutor etc. 

And like this many more. 

## The objective of the Library

This library acts as a wrapper to combine both the engines lm-eval-harness (mostly consist of evaluation dataset across different general tasks) and bigcode-eval-harness (evaluation dataset exclusivelty for code-generation tasks) with common interfaces. The features of the library include:

1. Adding a common interface between the two libraries for handling evaluation workloads. 
2. Providing interfaces to solve the above problems. 
3. Cutomization of models / addition of new benchmark datasets. 

## Getting Started and Usage:

Let's get started to install the library first. To do that open the terminal and make new virtual environment, and intall easyeval. 

```
pip install easyeval
```

### Usage

The very first version include a simple interface to interact with lm-eval-harness engine. Here is how you can do that. 

```python
from easy_eval import HarnessEvaluator
from easy_eval.config import EvaluatorConfig
```

Evaluation Config is where you provide your model's generation configuration. You can checkout all the configs [here](/easy_eval/config.py). After this, we instantiate our evaluator. 

```python
harness = HarnessEvaluator(model_name_or_path="gpt2", model_backend="huggingface", device="cpu")

# For device you can set cpu or cuda, the standard way of setting up devices. 
```

`HarnessEvaluator` expects you to provide the `model_backend`. Here are some supported backends:

1. [HuggingFace](https://huggingface.co/)
2. [vLLM](https://github.com/vllm-project/vllm)
3. [Anthropic](https://www.anthropic.com/)
4. [OpenAI](https://platform.openai.com/docs/introduction)
5. [OpenVino](https://github.com/openvinotoolkit/openvino)
6. [GGML/GGUF](https://github.com/ggerganov/ggml)
7. [Mamba](https://github.com/mamba-org/mamba)

And also `model_name_or_path` which is the name the model (if huggingface repo) or the model path of the corresponding `model_backend`

Once we instantiated our evaluator, we are going to define our config. Defining config is fully optional. If we not pass config, the default values in config will be choosen. 

```python
config = EvaluatorConfig(
    limit=10 # the number of datapoints to take for evaluation
)
```

And now we get our evaluation result by passing the config and list of evaluation tasks, we want our model to evaluate on. 

```python
results = harness.evaluate(
    tasks=["babi"],
    config=config, show_results_terminal=True
)

print(results)
```

This will return a `result` in a json format.

## Contributing

`easyeval` is at super early stage right now. You can check out the [roadmap](https://github.com/Anindyadeep/easy_eval/issues/2) to see what are the expected features to come in future. 

This is a fully open-sourced project. So contributions are highly appreciated. Here is how you can contribute:

1. Open issues to suggest improvement or features.
2. You can contribute to existing issues or do bug fixing by adding a pull request.


## Reference and Citations 

```
@misc{eval-harness,
  author       = {Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang and Tang, Eric and Thite, Anish and Wang, Ben and Wang, Kevin and Zou, Andy},
  title        = {A framework for few-shot language model evaluation},
  month        = 12,
  year         = 2023,
  publisher    = {Zenodo},
  version      = {v0.4.0},
  doi          = {10.5281/zenodo.10256836},
  url          = {https://zenodo.org/records/10256836}
}
```

```
@misc{bigcode-evaluation-harness,
  author       = {Ben Allal, Loubna and
                  Muennighoff, Niklas and
                  Kumar Umapathi, Logesh and
                  Lipkin, Ben and
                  von Werra, Leandro},
  title = {A framework for the evaluation of code generation models},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/bigcode-project/bigcode-evaluation-harness}},
  year = 2022,
}
```

Raw data

{
"_id": null,
"home_page": "https://github.com/Anindyadeep/easy_eval",
"name": "easy-evaluator",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "llm,evaluation,openai",
"author": "Anindyadeep",
"author_email": "proanindyadeep@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/a4/b5/d7b8a2c0560041ac78a152f05b1e5620947d189c7f882fcfa509e12f59bc/easy_evaluator-0.0.0.tar.gz",
"platform": null,
"description": "# EasyEval\n\nEasyEval is a fully open-source evaluation wrapper that aims to streamline the integration, customization, and expansion of robust evaluation engines like [lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness) and [bigcode-eval-harness](https://github.com/bigcode-project/bigcode-evaluation-harness) into existing production-grade or research pipelines effortlessly. It supports over 200 existing datasets and can be easily adapted for custom ones, making it a versatile solution for enhancing evaluation processes.\n\n### But Why?\n\nEvaluation has been open-problem for LLMs. When evaluating LLMs into production, we need to rely on different evaluation techniques. However the problem that we lot of times face is to integrate good evaluation engines into different existing production LLM pipelines. \n\nSo what are the solutions:\n\n1. Either go for an enterprise solution.\n2. Or look for Open Source solutions. \n\nNow there are some handful of open-soure libraries that does evaluation on large scale evaluation datasets. Some of the examples are:\n\n1. [LM Evaluation Harness by Eleuther AI](https://github.com/EleutherAI/lm-evaluation-harness)\n2. [BigCode Evaluation Harness by the BigCode Project](https://github.com/bigcode-project/bigcode-evaluation-harness)\n3. [Stanford HELM](https://crfm.stanford.edu/helm/lite/latest/)\n4. [OpenCompass](https://opencompass.org.cn/home)\n\nOther than that we have tons and tons of evaluation libraries where a huge percentage is an extension of the above engines. The way this engine works they define some taxonomy of how they evaluate. \n\nFor example: LM Evaluation Harness by Eleuther AI defines different tasks and under each task we have different datasets. We use the \"test/evaluation\" split of the datasets to evaluate the LLM of choice. \n\n### The problem\n\nThe problem with these evaluators is, most of them are CLI first. They expose very little documentation on their actual API interfaces. These libraries becomes super useful if they can be easily integrated or extended or customized with newer tasks in existing production pipelines. Production pipelines like:\n\n1. Making evaluation REST-API servers\n2. CI/CD pipelines for evaluation for LLM fine-tuning\n3. Leaderboard generations to compare across checkpoints or different LLMs.\n4. Supporting any custom model or engine. Example TensorRT or any API endpoint.\n5. GPT as evalutor etc. \n\nAnd like this many more. \n\n## The objective of the Library\n\nThis library acts as a wrapper to combine both the engines lm-eval-harness (mostly consist of evaluation dataset across different general tasks) and bigcode-eval-harness (evaluation dataset exclusivelty for code-generation tasks) with common interfaces. The features of the library include:\n\n1. Adding a common interface between the two libraries for handling evaluation workloads. \n2. Providing interfaces to solve the above problems. \n3. Cutomization of models / addition of new benchmark datasets. \n\n## Getting Started and Usage:\n\nLet's get started to install the library first. To do that open the terminal and make new virtual environment, and intall easyeval. \n\n```\npip install easyeval\n```\n\n### Usage\n\nThe very first version include a simple interface to interact with lm-eval-harness engine. Here is how you can do that. \n\n```python\nfrom easy_eval import HarnessEvaluator\nfrom easy_eval.config import EvaluatorConfig\n```\n\nEvaluation Config is where you provide your model's generation configuration. You can checkout all the configs [here](/easy_eval/config.py). After this, we instantiate our evaluator. \n\n```python\nharness = HarnessEvaluator(model_name_or_path=\"gpt2\", model_backend=\"huggingface\", device=\"cpu\")\n\n# For device you can set cpu or cuda, the standard way of setting up devices. \n```\n\n`HarnessEvaluator` expects you to provide the `model_backend`. Here are some supported backends:\n\n1. [HuggingFace](https://huggingface.co/)\n2. [vLLM](https://github.com/vllm-project/vllm)\n3. [Anthropic](https://www.anthropic.com/)\n4. [OpenAI](https://platform.openai.com/docs/introduction)\n5. [OpenVino](https://github.com/openvinotoolkit/openvino)\n6. [GGML/GGUF](https://github.com/ggerganov/ggml)\n7. [Mamba](https://github.com/mamba-org/mamba)\n\nAnd also `model_name_or_path` which is the name the model (if huggingface repo) or the model path of the corresponding `model_backend`\n\nOnce we instantiated our evaluator, we are going to define our config. Defining config is fully optional. If we not pass config, the default values in config will be choosen. \n\n```python\nconfig = EvaluatorConfig(\n limit=10 # the number of datapoints to take for evaluation\n)\n```\n\nAnd now we get our evaluation result by passing the config and list of evaluation tasks, we want our model to evaluate on. \n\n```python\nresults = harness.evaluate(\n tasks=[\"babi\"],\n config=config, show_results_terminal=True\n)\n\nprint(results)\n```\n\nThis will return a `result` in a json format.\n\n## Contributing\n\n`easyeval` is at super early stage right now. You can check out the [roadmap](https://github.com/Anindyadeep/easy_eval/issues/2) to see what are the expected features to come in future. \n\nThis is a fully open-sourced project. So contributions are highly appreciated. Here is how you can contribute:\n\n1. Open issues to suggest improvement or features.\n2. You can contribute to existing issues or do bug fixing by adding a pull request.\n\n\n## Reference and Citations \n\n```\n@misc{eval-harness,\n author = {Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang and Tang, Eric and Thite, Anish and Wang, Ben and Wang, Kevin and Zou, Andy},\n title = {A framework for few-shot language model evaluation},\n month = 12,\n year = 2023,\n publisher = {Zenodo},\n version = {v0.4.0},\n doi = {10.5281/zenodo.10256836},\n url = {https://zenodo.org/records/10256836}\n}\n```\n\n```\n@misc{bigcode-evaluation-harness,\n author = {Ben Allal, Loubna and\n Muennighoff, Niklas and\n Kumar Umapathi, Logesh and\n Lipkin, Ben and\n von Werra, Leandro},\n title = {A framework for the evaluation of code generation models},\n publisher = {GitHub},\n journal = {GitHub repository},\n howpublished = {\\url{https://github.com/bigcode-project/bigcode-evaluation-harness}},\n year = 2022,\n}\n```\n\n",
"bugtrack_url": null,
"license": "",
"summary": "A library for easy evaluation of language models",
"version": "0.0.0",
"project_urls": {
"Homepage": "https://github.com/Anindyadeep/easy_eval"
},
"split_keywords": [
"llm",
"evaluation",
"openai"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "c464c318e430f55b078cbfbb3777c0fdcdb77a109c12672a67ca6797e65dd68c",
"md5": "2a264c0d11b3aee4c1da71b6222d9864",
"sha256": "9d8c5e3449a320de971dc973e08f7d60b8c0119385aeaa8792228aa3f25e4b7c"
},
"downloads": -1,
"filename": "easy_evaluator-0.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "2a264c0d11b3aee4c1da71b6222d9864",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 8979,
"upload_time": "2024-03-03T15:30:13",
"upload_time_iso_8601": "2024-03-03T15:30:13.771185Z",
"url": "https://files.pythonhosted.org/packages/c4/64/c318e430f55b078cbfbb3777c0fdcdb77a109c12672a67ca6797e65dd68c/easy_evaluator-0.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "a4b5d7b8a2c0560041ac78a152f05b1e5620947d189c7f882fcfa509e12f59bc",
"md5": "636ffb0d4b27ac586ac59221a13b7c85",
"sha256": "b388ba73285214dd9cca795bd014b9a6bcf596f1149113c838b7b9e4bb5d0d41"
},
"downloads": -1,
"filename": "easy_evaluator-0.0.0.tar.gz",
"has_sig": false,
"md5_digest": "636ffb0d4b27ac586ac59221a13b7c85",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 10806,
"upload_time": "2024-03-03T15:30:15",
"upload_time_iso_8601": "2024-03-03T15:30:15.790031Z",
"url": "https://files.pythonhosted.org/packages/a4/b5/d7b8a2c0560041ac78a152f05b1e5620947d189c7f882fcfa509e12f59bc/easy_evaluator-0.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-03-03 15:30:15",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Anindyadeep",
"github_project": "easy_eval",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "easy-evaluator"
}

Anindyadeep