# `EvalPlus(📖) => 📚`
<p align="center">
<a href="https://evalplus.github.io/leaderboard.html"><img src="https://img.shields.io/badge/%F0%9F%8F%86-leaderboard-8A2BE2"></a>
<a href="https://openreview.net/forum?id=1qvx610Cu7"><img src="https://img.shields.io/badge/EvalPlus-NeurIPS'23-a55fed.svg"></a>
<a href="https://openreview.net/forum?id=IBCBMeAhmC"><img src="https://img.shields.io/badge/EvalPerf-COLM'24-a55fed.svg"></a>
<a href="https://huggingface.co/evalplus/"><img src="https://img.shields.io/badge/🤗%20Hugging%20Face-evalplus-%23ff8811.svg"></a>
<a href="https://pypi.org/project/evalplus/"><img src="https://img.shields.io/pypi/v/evalplus?color=g"></a>
<a href="https://hub.docker.com/r/ganler/evalplus" title="Docker"><img src="https://img.shields.io/docker/image-size/ganler/evalplus"></a>
</p>
<p align="center">
<a href="#-news">📰News</a> •
<a href="#-quick-start">🔥Quick Start</a> •
<a href="#-llm-backends">🚀LLM Backends</a> •
<a href="#-documents">📚Documents</a> •
<a href="#-citation">📜Citation</a> •
<a href="#-acknowledgement">🙏Acknowledgement</a>
</p>
## About
EvalPlus is a rigorous evaluation framework for LLM4Code, with:
- ✨ **HumanEval+**: 80x more tests than the original HumanEval!
- ✨ **MBPP+**: 35x more tests than the original MBPP!
- ✨ **EvalPerf**: evaluating the efficiency of LLM-generated code!
- ✨ **Framework**: our packages/images/tools can easily and safely evaluate LLMs on above benchmarks.
Why EvalPlus?
- ✨ **Precise evaluation & ranking**: See [our leaderboard](https://evalplus.github.io/leaderboard.html) for latest LLM rankings before & after rigorous evaluation.
- ✨ **Coding rigorousness**: Look at the score differences! esp. before and after using EvalPlus tests! Less drop is better as it means more rigorousness and less laxity in code generation; while a big drop means the generated code tends to be fragile.
- ✨ **Code efficiency**: Beyond correctness, our EvalPerf dataset evaluates the efficiency of LLM-generated code via performance-exercising coding tasks and test inputs.
Want to know more details? Read our papers & materials!
- **EvalPlus**: [NeurIPS'23 paper](https://openreview.net/forum?id=1qvx610Cu7), [Google Slides](https://docs.google.com/presentation/d/1eTxzUQG9uHaU13BGhrqm4wH5NmMZiM3nI0ezKlODxKs), [Poster](https://jw-liu.xyz/assets/pdf/EvalPlus_Poster.pdf)
- **EvalPerf**: [COLM'24 paper](https://openreview.net/forum?id=IBCBMeAhmC), [Poster](https://jw-liu.xyz/assets/pdf/jiawei-colm-evalperf-poster.pdf), [Documentation](./docs/evalperf.md)
## 📰 News
Below tracks the notable updates of EvalPlus:
- **[2024-10-20 `v0.3.1`]**: EvalPlus `v0.3.1` is officially released! Release highlights includes (i) Code efficiency evaluation via EvalPerf, (ii) one command to run the whole pipline (generation + post-processing + evaluation), (iii) support for more inference backends such as Google Gemini & Anthropic, etc.
- **[2024-06-09 pre `v0.3.0`]**: Improved ground-truth solutions for MBPP+ tasks (IDs: 459, 102, 559). Thanks to [EvalArena](https://github.com/crux-eval/eval-arena).
- **[2024-04-17 pre `v0.3.0`]**: MBPP+ is upgraded to `v0.2.0` by removing some broken tasks (399 -> 378 tasks). ~4pp pass@1 improvement could be expected.
- **Earlier**:
- ([`v0.2.1`](https://github.com/evalplus/evalplus/releases/tag/v0.2.1)) You can use EvalPlus datasets via [bigcode-evaluation-harness](https://github.com/bigcode-project/bigcode-evaluation-harness)! HumanEval+ oracle fixes (32).
- ([`v0.2.0`](https://github.com/evalplus/evalplus/releases/tag/v0.2.0)) MBPP+ is released! HumanEval contract & input fixes (0/3/9/148/114/1/2/99/28/32/35/160).
- ([`v0.1.7`](https://github.com/evalplus/evalplus/releases/tag/v0.1.7)) [Leaderboard](https://evalplus.github.io/leaderboard.html) release; HumanEval+ contract and input fixes (32/166/126/6)
- ([`v0.1.6`](https://github.com/evalplus/evalplus/releases/tag/v0.1.6)) Configurable and by-default-conservative timeout settings; HumanEval+ contract & ground-truth fixes (129/148/75/53/0/3/9/140)
- ([`v0.1.5`](https://github.com/evalplus/evalplus/releases/tag/v0.1.5)) HumanEval+ mini is released for ultra-fast evaluation when you have too many samples!
- ([`v0.1.1`](https://github.com/evalplus/evalplus/releases/tag/v0.1.1)) Optimizing user experiences: evaluation speed, PyPI package, Docker, etc.
- ([`v0.1.0`](https://github.com/evalplus/evalplus/releases/tag/v0.1.0)) HumanEval+ is released!
## 🔥 Quick Start
- Code correctness evaluation: HumanEval(+) or MBPP(+)
```bash
pip install --upgrade "evalplus[vllm] @ git+https://github.com/evalplus/evalplus"
# Or `pip install "evalplus[vllm]" --upgrade` for the latest stable release
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--dataset [humaneval|mbpp] \
--backend vllm \
--greedy
```
<details><summary>Code execution within Docker <i>:: click to expand ::</i></summary>
<div>
```bash
# Local generation
evalplus.codegen --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--dataset humaneval \
--backend vllm \
--greedy
# Code execution within Docker
docker run --rm ganler/evalplus:latest -v $(pwd)/evalplus_results:/app \
evalplus.evaluate --dataset humaneval \
--samples /app/humaneval/ise-uiuc--Magicoder-S-DS-6.7B_vllm_temp_0.0.jsonl
```
</div>
</details>
- Code efficiency evaluation: EvalPerf (*nix only)
```bash
pip install --upgrade "evalplus[perf,vllm] @ git+https://github.com/evalplus/evalplus"
# Or `pip install "evalplus[perf,vllm]" --upgrade` for the latest stable release
sudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid' # Enable perf
evalplus.evalperf --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--backend vllm
```
<details><summary>Code execution within Docker <i>:: click to expand ::</i></summary>
<div>
```bash
# Local generation
evalplus.codegen --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--dataset evalperf \
--backend vllm \
--temperture 1.0 \
--n-samples 100
# Code execution within Docker
docker run --cap-add PERFMON --rm ganler/evalplus:latest -v $(pwd)/evalplus_results:/app \
evalplus.evalperf --samples /app/humaneval/ise-uiuc--Magicoder-S-DS-6.7B_vllm_temp_1.0.jsonl
```
</div>
</details>
## 🚀 LLM Backends
### HuggingFace models
- `transformers` backend:
```bash
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--dataset [humaneval|mbpp] \
--backend hf \
--greedy
```
> [!Note]
>
> EvalPlus uses different prompts for base and chat models.
> By default it is detected by `tokenizer.chat_template` when using `hf`/`vllm` as backend.
> For other backends, only chat mode is allowed.
>
> Therefore, if your base models come with a `tokenizer.chat_template`,
> please add `--force-base-prompt` to avoid being evaluated
> in a chat mode.
<details><summary>Enable Flash Attention 2 <i>:: click to expand ::</i></summary>
<div>
```bash
# Install Flash Attention 2
pip install packaging ninja
pip install flash-attn --no-build-isolation
# Note: if you have installation problem, consider using pre-built
# wheels from https://github.com/Dao-AILab/flash-attention/releases
# Run evaluation with FA2
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--dataset [humaneval|mbpp] \
--backend hf \
--attn-implementation [flash_attention_2|sdpa] \
--greedy
```
</div>
</details>
- `vllm` backend:
```bash
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--dataset [humaneval|mbpp] \
--backend vllm \
--tp [TENSOR_PARALLEL_SIZE] \
--greedy
```
- `openai` compatible servers (e.g., [vLLM](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html)):
```bash
# Launch a model server first: e.g., https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--dataset [humaneval|mbpp] \
--backend openai \
--base-url http://localhost:8000/v1 \
--greedy
```
### OpenAI models
- Access OpenAI APIs from [OpenAI Console](https://platform.openai.com/)
```bash
export OPENAI_API_KEY="[YOUR_API_KEY]"
evalplus.evaluate --model "gpt-4o" \
--dataset [humaneval|mbpp] \
--backend openai \
--greedy
```
### Anthropic models
- Access Anthropic APIs from [Anthropic Console](https://console.anthropic.com/)
```bash
export ANTHROPIC_API_KEY="[YOUR_API_KEY]"
evalplus.evaluate --model "claude-3-haiku-20240307" \
--dataset [humaneval|mbpp] \
--backend anthropic \
--greedy
```
### Google Gemini models
- Access Gemini APIs from [Google AI Studio](https://aistudio.google.com/)
```bash
export GOOGLE_API_KEY="[YOUR_API_KEY]"
evalplus.evaluate --model "gemini-1.5-pro" \
--dataset [humaneval|mbpp] \
--backend google \
--greedy
```
You can checkout the generation and results at `evalplus_results/[humaneval|mbpp]/`
<details><summary>⏬ Using EvalPlus as a local repo? <i>:: click to expand ::</i></summary>
<div>
```bash
git clone https://github.com/evalplus/evalplus.git
cd evalplus
export PYTHONPATH=$PYTHONPATH:$(pwd)
pip install -r requirements.txt
```
</div>
</details>
## 📚 Documents
To learn more about how to use EvalPlus, please refer to:
- [Command Line Interface](./docs/cli.md)
- [EvalPerf](./docs/evalperf.md)
- [Program Execution](./docs/execution.md)
## 📜 Citation
```bibtex
@inproceedings{evalplus,
title = {Is Your Code Generated by Chat{GPT} Really Correct? Rigorous Evaluation of Large Language Models for Code Generation},
author = {Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming},
booktitle = {Thirty-seventh Conference on Neural Information Processing Systems},
year = {2023},
url = {https://openreview.net/forum?id=1qvx610Cu7},
}
@inproceedings{evalperf,
title = {Evaluating Language Models for Efficient Code Generation},
author = {Liu, Jiawei and Xie, Songrun and Wang, Junhao and Wei, Yuxiang and Ding, Yifeng and Zhang, Lingming},
booktitle = {First Conference on Language Modeling},
year = {2024},
url = {https://openreview.net/forum?id=IBCBMeAhmC},
}
```
## 🙏 Acknowledgement
- [HumanEval](https://github.com/openai/human-eval)
- [MBPP](https://github.com/google-research/google-research/tree/master/mbpp)
Raw data
{
"_id": null,
"home_page": "https://github.com/evalplus/evalplus",
"name": "evalplus",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": null,
"author": null,
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/91/ab/b1f6d8420542dc615a7500abf7c0e3bab4935f04148a8be3c244d520d8c0/evalplus-0.3.1.tar.gz",
"platform": "any",
"description": "# `EvalPlus(\ud83d\udcd6) => \ud83d\udcda`\n\n<p align=\"center\">\n <a href=\"https://evalplus.github.io/leaderboard.html\"><img src=\"https://img.shields.io/badge/%F0%9F%8F%86-leaderboard-8A2BE2\"></a>\n <a href=\"https://openreview.net/forum?id=1qvx610Cu7\"><img src=\"https://img.shields.io/badge/EvalPlus-NeurIPS'23-a55fed.svg\"></a>\n <a href=\"https://openreview.net/forum?id=IBCBMeAhmC\"><img src=\"https://img.shields.io/badge/EvalPerf-COLM'24-a55fed.svg\"></a>\n <a href=\"https://huggingface.co/evalplus/\"><img src=\"https://img.shields.io/badge/\ud83e\udd17%20Hugging%20Face-evalplus-%23ff8811.svg\"></a>\n <a href=\"https://pypi.org/project/evalplus/\"><img src=\"https://img.shields.io/pypi/v/evalplus?color=g\"></a>\n <a href=\"https://hub.docker.com/r/ganler/evalplus\" title=\"Docker\"><img src=\"https://img.shields.io/docker/image-size/ganler/evalplus\"></a>\n</p>\n\n<p align=\"center\">\n <a href=\"#-news\">\ud83d\udcf0News</a> \u2022\n <a href=\"#-quick-start\">\ud83d\udd25Quick Start</a> \u2022\n <a href=\"#-llm-backends\">\ud83d\ude80LLM Backends</a> \u2022\n <a href=\"#-documents\">\ud83d\udcdaDocuments</a> \u2022\n <a href=\"#-citation\">\ud83d\udcdcCitation</a> \u2022\n <a href=\"#-acknowledgement\">\ud83d\ude4fAcknowledgement</a>\n</p>\n\n## About\n\nEvalPlus is a rigorous evaluation framework for LLM4Code, with:\n\n- \u2728 **HumanEval+**: 80x more tests than the original HumanEval!\n- \u2728 **MBPP+**: 35x more tests than the original MBPP!\n- \u2728 **EvalPerf**: evaluating the efficiency of LLM-generated code!\n- \u2728 **Framework**: our packages/images/tools can easily and safely evaluate LLMs on above benchmarks.\n\nWhy EvalPlus?\n\n- \u2728 **Precise evaluation & ranking**: See [our leaderboard](https://evalplus.github.io/leaderboard.html) for latest LLM rankings before & after rigorous evaluation.\n- \u2728 **Coding rigorousness**: Look at the score differences! esp. before and after using EvalPlus tests! Less drop is better as it means more rigorousness and less laxity in code generation; while a big drop means the generated code tends to be fragile.\n- \u2728 **Code efficiency**: Beyond correctness, our EvalPerf dataset evaluates the efficiency of LLM-generated code via performance-exercising coding tasks and test inputs.\n\nWant to know more details? Read our papers & materials!\n\n- **EvalPlus**: [NeurIPS'23 paper](https://openreview.net/forum?id=1qvx610Cu7), [Google Slides](https://docs.google.com/presentation/d/1eTxzUQG9uHaU13BGhrqm4wH5NmMZiM3nI0ezKlODxKs), [Poster](https://jw-liu.xyz/assets/pdf/EvalPlus_Poster.pdf)\n- **EvalPerf**: [COLM'24 paper](https://openreview.net/forum?id=IBCBMeAhmC), [Poster](https://jw-liu.xyz/assets/pdf/jiawei-colm-evalperf-poster.pdf), [Documentation](./docs/evalperf.md)\n\n## \ud83d\udcf0 News\n\nBelow tracks the notable updates of EvalPlus:\n\n- **[2024-10-20 `v0.3.1`]**: EvalPlus `v0.3.1` is officially released! Release highlights includes (i) Code efficiency evaluation via EvalPerf, (ii) one command to run the whole pipline (generation + post-processing + evaluation), (iii) support for more inference backends such as Google Gemini & Anthropic, etc.\n- **[2024-06-09 pre `v0.3.0`]**: Improved ground-truth solutions for MBPP+ tasks (IDs: 459, 102, 559). Thanks to [EvalArena](https://github.com/crux-eval/eval-arena).\n- **[2024-04-17 pre `v0.3.0`]**: MBPP+ is upgraded to `v0.2.0` by removing some broken tasks (399 -> 378 tasks). ~4pp pass@1 improvement could be expected.\n- **Earlier**:\n - ([`v0.2.1`](https://github.com/evalplus/evalplus/releases/tag/v0.2.1)) You can use EvalPlus datasets via [bigcode-evaluation-harness](https://github.com/bigcode-project/bigcode-evaluation-harness)! HumanEval+ oracle fixes (32).\n - ([`v0.2.0`](https://github.com/evalplus/evalplus/releases/tag/v0.2.0)) MBPP+ is released! HumanEval contract & input fixes (0/3/9/148/114/1/2/99/28/32/35/160).\n - ([`v0.1.7`](https://github.com/evalplus/evalplus/releases/tag/v0.1.7)) [Leaderboard](https://evalplus.github.io/leaderboard.html) release; HumanEval+ contract and input fixes (32/166/126/6)\n - ([`v0.1.6`](https://github.com/evalplus/evalplus/releases/tag/v0.1.6)) Configurable and by-default-conservative timeout settings; HumanEval+ contract & ground-truth fixes (129/148/75/53/0/3/9/140)\n - ([`v0.1.5`](https://github.com/evalplus/evalplus/releases/tag/v0.1.5)) HumanEval+ mini is released for ultra-fast evaluation when you have too many samples!\n - ([`v0.1.1`](https://github.com/evalplus/evalplus/releases/tag/v0.1.1)) Optimizing user experiences: evaluation speed, PyPI package, Docker, etc.\n - ([`v0.1.0`](https://github.com/evalplus/evalplus/releases/tag/v0.1.0)) HumanEval+ is released!\n\n## \ud83d\udd25 Quick Start\n\n- Code correctness evaluation: HumanEval(+) or MBPP(+)\n\n```bash\npip install --upgrade \"evalplus[vllm] @ git+https://github.com/evalplus/evalplus\"\n# Or `pip install \"evalplus[vllm]\" --upgrade` for the latest stable release\n\nevalplus.evaluate --model \"ise-uiuc/Magicoder-S-DS-6.7B\" \\\n --dataset [humaneval|mbpp] \\\n --backend vllm \\\n --greedy\n```\n\n<details><summary>Code execution within Docker <i>:: click to expand ::</i></summary>\n<div>\n\n```bash\n# Local generation\nevalplus.codegen --model \"ise-uiuc/Magicoder-S-DS-6.7B\" \\\n --dataset humaneval \\\n --backend vllm \\\n --greedy\n\n# Code execution within Docker\ndocker run --rm ganler/evalplus:latest -v $(pwd)/evalplus_results:/app \\\n evalplus.evaluate --dataset humaneval \\\n --samples /app/humaneval/ise-uiuc--Magicoder-S-DS-6.7B_vllm_temp_0.0.jsonl\n```\n\n</div>\n</details>\n\n- Code efficiency evaluation: EvalPerf (*nix only)\n\n```bash\npip install --upgrade \"evalplus[perf,vllm] @ git+https://github.com/evalplus/evalplus\"\n# Or `pip install \"evalplus[perf,vllm]\" --upgrade` for the latest stable release\n\nsudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid' # Enable perf\nevalplus.evalperf --model \"ise-uiuc/Magicoder-S-DS-6.7B\" \\\n --backend vllm\n```\n\n<details><summary>Code execution within Docker <i>:: click to expand ::</i></summary>\n<div>\n\n```bash\n# Local generation\nevalplus.codegen --model \"ise-uiuc/Magicoder-S-DS-6.7B\" \\\n --dataset evalperf \\\n --backend vllm \\\n --temperture 1.0 \\\n --n-samples 100\n\n# Code execution within Docker\ndocker run --cap-add PERFMON --rm ganler/evalplus:latest -v $(pwd)/evalplus_results:/app \\\n evalplus.evalperf --samples /app/humaneval/ise-uiuc--Magicoder-S-DS-6.7B_vllm_temp_1.0.jsonl\n```\n\n</div>\n</details>\n\n## \ud83d\ude80 LLM Backends\n\n### HuggingFace models\n\n- `transformers` backend:\n\n```bash\nevalplus.evaluate --model \"ise-uiuc/Magicoder-S-DS-6.7B\" \\\n --dataset [humaneval|mbpp] \\\n --backend hf \\\n --greedy\n```\n\n> [!Note]\n>\n> EvalPlus uses different prompts for base and chat models.\n> By default it is detected by `tokenizer.chat_template` when using `hf`/`vllm` as backend.\n> For other backends, only chat mode is allowed.\n>\n> Therefore, if your base models come with a `tokenizer.chat_template`,\n> please add `--force-base-prompt` to avoid being evaluated\n> in a chat mode.\n\n<details><summary>Enable Flash Attention 2 <i>:: click to expand ::</i></summary>\n<div>\n\n```bash\n# Install Flash Attention 2\npip install packaging ninja\npip install flash-attn --no-build-isolation\n# Note: if you have installation problem, consider using pre-built\n# wheels from https://github.com/Dao-AILab/flash-attention/releases\n\n# Run evaluation with FA2\nevalplus.evaluate --model \"ise-uiuc/Magicoder-S-DS-6.7B\" \\\n --dataset [humaneval|mbpp] \\\n --backend hf \\\n --attn-implementation [flash_attention_2|sdpa] \\\n --greedy\n```\n\n</div>\n</details>\n\n- `vllm` backend:\n\n```bash\nevalplus.evaluate --model \"ise-uiuc/Magicoder-S-DS-6.7B\" \\\n --dataset [humaneval|mbpp] \\\n --backend vllm \\\n --tp [TENSOR_PARALLEL_SIZE] \\\n --greedy\n```\n\n- `openai` compatible servers (e.g., [vLLM](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html)):\n\n```bash\n# Launch a model server first: e.g., https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html\nevalplus.evaluate --model \"ise-uiuc/Magicoder-S-DS-6.7B\" \\\n --dataset [humaneval|mbpp] \\\n --backend openai \\\n --base-url http://localhost:8000/v1 \\\n --greedy\n```\n\n### OpenAI models\n\n- Access OpenAI APIs from [OpenAI Console](https://platform.openai.com/)\n\n```bash\nexport OPENAI_API_KEY=\"[YOUR_API_KEY]\"\nevalplus.evaluate --model \"gpt-4o\" \\\n --dataset [humaneval|mbpp] \\\n --backend openai \\\n --greedy\n```\n\n### Anthropic models\n\n- Access Anthropic APIs from [Anthropic Console](https://console.anthropic.com/)\n\n```bash\nexport ANTHROPIC_API_KEY=\"[YOUR_API_KEY]\"\nevalplus.evaluate --model \"claude-3-haiku-20240307\" \\\n --dataset [humaneval|mbpp] \\\n --backend anthropic \\\n --greedy\n```\n\n### Google Gemini models\n\n- Access Gemini APIs from [Google AI Studio](https://aistudio.google.com/)\n\n```bash\nexport GOOGLE_API_KEY=\"[YOUR_API_KEY]\"\nevalplus.evaluate --model \"gemini-1.5-pro\" \\\n --dataset [humaneval|mbpp] \\\n --backend google \\\n --greedy\n```\n\nYou can checkout the generation and results at `evalplus_results/[humaneval|mbpp]/`\n\n<details><summary>\u23ec Using EvalPlus as a local repo? <i>:: click to expand ::</i></summary>\n<div>\n\n```bash\ngit clone https://github.com/evalplus/evalplus.git\ncd evalplus\nexport PYTHONPATH=$PYTHONPATH:$(pwd)\npip install -r requirements.txt\n```\n\n</div>\n</details>\n\n## \ud83d\udcda Documents\n\nTo learn more about how to use EvalPlus, please refer to:\n\n- [Command Line Interface](./docs/cli.md)\n- [EvalPerf](./docs/evalperf.md)\n- [Program Execution](./docs/execution.md)\n\n## \ud83d\udcdc Citation\n\n```bibtex\n@inproceedings{evalplus,\n title = {Is Your Code Generated by Chat{GPT} Really Correct? Rigorous Evaluation of Large Language Models for Code Generation},\n author = {Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming},\n booktitle = {Thirty-seventh Conference on Neural Information Processing Systems},\n year = {2023},\n url = {https://openreview.net/forum?id=1qvx610Cu7},\n}\n\n@inproceedings{evalperf,\n title = {Evaluating Language Models for Efficient Code Generation},\n author = {Liu, Jiawei and Xie, Songrun and Wang, Junhao and Wei, Yuxiang and Ding, Yifeng and Zhang, Lingming},\n booktitle = {First Conference on Language Modeling},\n year = {2024},\n url = {https://openreview.net/forum?id=IBCBMeAhmC},\n}\n```\n\n## \ud83d\ude4f Acknowledgement\n\n- [HumanEval](https://github.com/openai/human-eval)\n- [MBPP](https://github.com/google-research/google-research/tree/master/mbpp)\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "\"EvalPlus for rigourous evaluation of LLM-synthesized code\"",
"version": "0.3.1",
"project_urls": {
"Homepage": "https://github.com/evalplus/evalplus"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "1366de5ac13e00c681cff679c981b1bae98ab3b940fa3db2f0d8d70a0213424d",
"md5": "4e9b95c1ea1fb2f503e2cc5c80ff1e76",
"sha256": "cd601debb67419113d10ac5c3317689d847f27de5d8cf3837975f3cab571b75d"
},
"downloads": -1,
"filename": "evalplus-0.3.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "4e9b95c1ea1fb2f503e2cc5c80ff1e76",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 68572,
"upload_time": "2024-10-20T20:08:25",
"upload_time_iso_8601": "2024-10-20T20:08:25.938504Z",
"url": "https://files.pythonhosted.org/packages/13/66/de5ac13e00c681cff679c981b1bae98ab3b940fa3db2f0d8d70a0213424d/evalplus-0.3.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "91abb1f6d8420542dc615a7500abf7c0e3bab4935f04148a8be3c244d520d8c0",
"md5": "6a6f2bf84f2458b6119f4b8f38b6e381",
"sha256": "732352dba404d96dd97a1593b3209fd50a09f543bfe37058a0bacf3fd23025da"
},
"downloads": -1,
"filename": "evalplus-0.3.1.tar.gz",
"has_sig": false,
"md5_digest": "6a6f2bf84f2458b6119f4b8f38b6e381",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 606441,
"upload_time": "2024-10-20T20:08:27",
"upload_time_iso_8601": "2024-10-20T20:08:27.844855Z",
"url": "https://files.pythonhosted.org/packages/91/ab/b1f6d8420542dc615a7500abf7c0e3bab4935f04148a8be3c244d520d8c0/evalplus-0.3.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-20 20:08:27",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "evalplus",
"github_project": "evalplus",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "wget",
"specs": []
},
{
"name": "appdirs",
"specs": []
},
{
"name": "tempdir",
"specs": []
},
{
"name": "multipledispatch",
"specs": []
},
{
"name": "numpy",
"specs": []
},
{
"name": "tqdm",
"specs": []
},
{
"name": "termcolor",
"specs": []
},
{
"name": "fire",
"specs": []
},
{
"name": "rich",
"specs": []
},
{
"name": "openai",
"specs": []
},
{
"name": "tree_sitter",
"specs": [
[
">=",
"0.22.0"
]
]
},
{
"name": "tree-sitter-python",
"specs": []
},
{
"name": "datasets",
"specs": []
},
{
"name": "psutil",
"specs": []
},
{
"name": "vllm",
"specs": []
},
{
"name": "anthropic",
"specs": []
},
{
"name": "google-generativeai",
"specs": []
},
{
"name": "transformers",
"specs": []
},
{
"name": "stop-sequencer",
"specs": []
}
],
"lcname": "evalplus"
}