bigcodebench


Namebigcodebench JSON
Version 0.1.7 PyPI version JSON
download
home_pagehttps://github.com/bigcode-project/bigcodebench
Summary"Evaluation package for BigCodeBench"
upload_time2024-06-27 23:39:47
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseApache-2.0
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # BigCodeBench
<center>
<img src="https://github.com/bigcode-bench/bigcode-bench.github.io/blob/main/asset/bigcodebench_banner.svg?raw=true" alt="BigCodeBench">
</center>

<p align="center">
    <a href="https://pypi.org/project/bigcodebench/"><img src="https://img.shields.io/pypi/v/bigcodebench?color=g"></a>
    <a href="https://hub.docker.com/r/bigcodebench/bigcodebench-evaluate" title="Docker-Eval"><img src="https://img.shields.io/docker/image-size/bigcodebench/bigcodebench-evaluate"></a>
    <a href="https://hub.docker.com/r/bigcodebench/bigcodebench-generate" title="Docker-Gen"><img src="https://img.shields.io/docker/image-size/bigcodebench/bigcodebench-generate"></a>
    <a href="https://github.com/bigcodebench/bigcodebench/blob/master/LICENSE"><img src="https://img.shields.io/pypi/l/bigcodebench"></a>
</p>

<p align="center">
    <a href="#-about">🌸About</a> β€’
    <a href="#-quick-start">πŸ”₯Quick Start</a> β€’
    <a href="#-llm-generated-code">πŸ’»LLM code</a> β€’
    <a href="#-failure-inspection">πŸ”Failure inspection</a> β€’
    <a href="#-full-script">πŸš€Full Script</a> β€’
    <a href="#-result-analysis">πŸ“ŠResult Analysis</a> β€’
    <a href="#-known-issues">🐞Known issues</a> β€’
    <a href="#-citation">πŸ“œCitation</a> β€’
    <a href="#-acknowledgement">πŸ™Acknowledgement</a>
</p>

## About

### BigCodeBench

BigCodeBench is an **_easy-to-use_** benchmark for code generation with **_practical_** and **_challenging_** programming tasks. It aims to evaluate the true programming capabilities of large language models (LLMs) in a more realistic setting. The benchmark is designed for HumanEval-like function-level code generation tasks, but with much more complex instructions and diverse function calls.
To facilitate the evaluation of LLMs on BigCodeBench, we provide this Python package `bigcodebench` that includes the dataset, generation scripts, and evaluation scripts. The package is built on top of the [EvalPlus](https://github.com/evalplus/evalplus) framework, which is a flexible and extensible evaluation framework for code generation tasks.

### Why BigCodeBench?

BigCodeBench focuses on the evaluation of LLM4Code with *diverse function calls* and *complex instruction*, with:

* ✨ **Precise evaluation & ranking**: See [our leaderboard](https://huggingface.co/spaces/bigcode/bigcodebench-leaderboard) for latest LLM rankings before & after rigorous evaluation.
* ✨ **Pre-generated samples**: BigCodeBench accelerates code intelligence research by open-sourcing [LLM-generated samples](#-LLM-generated-code) for various models -- no need to re-run the expensive benchmarks!

### Main Differences from EvalPlus

We inherit the design of the EvalPlus framework, which is a flexible and extensible evaluation framework for code generation tasks. However, BigCodeBench has the following differences:
* Execution Environment: The execution environment in BigCodeBench is less bounded than EvalPlus to support tasks with diverse library dependencies.
* Test Evaluation: BigCodeBench relies on `unittest` for evaluating the generated code, which is more suitable for the test harness in BigCodeBench.

## πŸ”₯ Quick Start

> [!Tip]
>
> BigCodeBench ❀️ [bigcode-evaluation-harness](https://github.com/bigcode-project/bigcode-evaluation-harness)!
> BigCodeBench will be integrated to bigcode-evaluation-harness, and you can also run it there!

To get started, please first set up the environment:

```bash
# Install to use bigcodebench.evaluate
pip install bigcodebench --upgrade
# If you want to use the evaluate locally, you need to install the requirements
pip install -I -r https://raw.githubusercontent.com/bigcode-project/bigcodebench/main/Requirements/requirements-eval.txt

# Install to use bigcodebench.generate
# You are strongly recommended to install the generate dependencies in a separate environment
pip install bigcodebench[generate] --upgrade
```

<details><summary>⏬ Install nightly version <i>:: click to expand ::</i></summary>
<div>

```bash
# Install to use bigcodebench.evaluate
pip install "git+https://github.com/bigcode-project/bigcodebench.git" --upgrade
```

</div>
</details>

<details><summary>⏬ Using BigCodeBench as a local repo? <i>:: click to expand ::</i></summary>
<div>

```bash
git clone https://github.com/bigcode-project/bigcodebench.git
cd bigcodebench
export PYTHONPATH=$PYTHONPATH:$(pwd)
# Install to use bigcodebench.evaluate
pip install -e .
# Install to use bigcodebench.generate
pip install -e .[generate]
```

</div>
</details>

### Code Generation

You are suggested to use `flash-attn` for generating code samples.
```bash
pip install -U flash-attn
```

To generate code samples from a model, you can use the following command:
>
```bash
bigcodebench.generate \
    --model [model_name] \
    --subset [complete|instruct] \
    --greedy \
    --bs [bs] \
    --temperature [temp] \
    --n_samples [n_samples] \
    --resume \
    --backend [vllm|hf|openai|mistral|anthropic|google] \
    --tp [gpu_number] \
    [--trust_remote_code] \
    [--base_url [base_url]]
```
>
The generated code samples will be stored in a file named `[model_name]--bigcodebench-[instruct|complete]--[backend]-[temp]-[n_samples].jsonl`. Alternatively, you can use the following command to utilize our pre-built docker images for generating code samples:
>
```bash
# If you are using GPUs
docker run --gpus '"device=$CUDA_VISIBLE_DEVICES"' -v $(pwd):/app -t bigcodebench/bigcodebench-generate:latest \
    --model [model_name] \ 
    --subset [complete|instruct] \
    [--greedy] \
    --bs [bs] \   
    --temperature [temp] \
    --n_samples [n_samples] \
    --resume \
    --backend [vllm|hf|openai|mistral|anthropic|google] \
    --tp [gpu_number]

# ...Or if you are using CPUs
docker run -v $(pwd):/app -t bigcodebench/bigcodebench-generate:latest \
    --model [model_name] \ 
    --subset [complete|instruct] \
    [--greedy] \
    --bs [bs] \   
    --temperature [temp] \
    --n_samples [n_samples] \
    --resume \
    --backend [vllm|hf|openai|mistral|anthropic|google]
```
>
```bash
# If you wish to use gated or private HuggingFace models and datasets
docker run -e HUGGING_FACE_HUB_TOKEN=$token -v $(pwd):/app -t bigcodebench/bigcodebench-generate:latest # omit other arguments4

# Similarly, to use other backends that require authentication
docker run -e OPENAI_API_KEY=$OPENAI_API_KEY -v $(pwd):/app -t bigcodebench/bigcodebench-generate:latest # omit other arguments
docker run -e GOOGLE_API_KEY=$OPENAI_API_KEY -v $(pwd):/app -t bigcodebench/bigcodebench-generate:latest # omit other arguments
docker run -e ANTHROPIC_KEY=$ANTHROPIC_KEY -v $(pwd):/app -t bigcodebench/bigcodebench-generate:latest # omit other arguments
```
>
Following which, you can run the built container as shown in above.
>
<details><summary>πŸ€” Structure of `problem`? <i>:: click to expand ::</i></summary>
<div>

* `task_id` is the identifier string for the task
* `entry_point` is the name of the function
* `complete_prompt` is the prompt for BigCodeBench-Complete
* `instruct_prompt` is the prompt for BigCodeBench-Instruct
+ `canonical_solution` is the ground-truth implementation
+ `test` is the `unittest.TestCase` class

</div>
</details>

> [!Note]
>
> **Expected Schema of `[model_name]--bigcodebench-[task]--[backend]-[temp]-[n_samples].jsonl`**
>
> 1. `task_id`: Task ID, which are the keys of `get_bigcodebench()`
> 2. `solution` (optional): Self-contained solution (usually including the prompt)
>    * Example: `{"task_id": "BigCodeBench/?", "solution": "def f():\n    return 1"}`

### Code Post-processing

LLM-generated text may not be compilable code for including natural language lines or incomplete extra code.
We provide a tool namely `bigcodebench.sanitize` to clean up the code:

```bash
# πŸ’‘ If you want to get the calibrated results:
bigcodebench.sanitize --samples samples.jsonl --calibrate
# Sanitized code will be produced to `samples-sanitized-calibrated.jsonl`

# πŸ’‘ If you want to get the original results:
bigcodebench.sanitize --samples samples.jsonl
# Sanitized code will be produced to `samples-sanitized.jsonl`

# πŸ’‘ If you are storing codes in directories:
bigcodebench.sanitize --samples /path/to/vicuna-[??]b_temp_[??]
# Sanitized code will be produced to `/path/to/vicuna-[??]b_temp_[??]-sanitized`
```

If you want to use the pre-built docker images for post-processing, you can use the following command:

```bash
# Change the entrypoint to bigcodebench.sanitize in any pre-built docker image, like bigcodebench/bigcodebench-evaluate:latest
docker run -it --entrypoint bigcodebench.sanitize -v $(pwd):/app bigcodebench/bigcodebench-evaluate:latest --samples samples.jsonl
```

<details><summary>πŸ”Ž Checking the compatibility of post-processed code<i>:: click to expand ::</i></summary>
<div>

To double-check the post-processing results, you can use `bigcodebench.syncheck` to check the code validity before and after sanitization, which will print erroneous code snippets and why they are wrong:

```bash
# πŸ’‘ If you are storing codes in jsonl:
bigcodebench.syncheck --samples samples.jsonl

# πŸ’‘ If you are storing codes in directories:
bigcodebench.syncheck --samples /path/to/vicuna-[??]b_temp_[??]

# πŸ’‘ Or change the entrypoint to bigcodebench.syncheck in any pre-built docker image, like 
docker run -it --entrypoint bigcodebench.syncheck -v $(pwd):/app bigcodebench/bigcodebench-evaluate:latest --samples samples.jsonl
```

</div>
</details>


### Code Evaluation

You are strongly recommended to use a sandbox such as [docker](https://docs.docker.com/get-docker/):

```bash
# Mount the current directory to the container
# If you want to change the RAM address space limit (in MB, 128 GB by default): `--max-as-limit XXX`
# If you want to change the RAM data segment limit (in MB, 4 GB by default): `--max-data-limit`
# If you want to change the RAM stack limit (in MB, 4 MB by default): `--max-stack-limit`
docker run -v $(pwd):/app bigcodebench/bigcodebench-evaluate:latest --subset [complete|instruct] --samples samples-sanitized-calibrated.jsonl

# If you only want to check the ground truths
docker run -v $(pwd):/app bigcodebench/bigcodebench-evaluate:latest --subset [complete|instruct] --samples samples-sanitized-calibrated.jsonl --check-gt-only
```

...Or if you want to try it locally regardless of the risks ⚠️:

First, install the dependencies for BigCodeBench:

```bash
pip install -r https://raw.githubusercontent.com/bigcode-project/bigcodebench/main/Requirements/requirements-eval.txt
```

Then, run the evaluation:

```bash
# ...Or locally ⚠️
bigcodebench.evaluate --subset [complete|instruct] --samples samples-sanitized-calibrated.jsonl
# ...If you really don't want to check the ground truths
bigcodebench.evaluate --subset [complete|instruct] --samples samples-sanitized-calibrated.jsonl --no-gt
```

> [!Tip]
>
> Do you use a very slow machine?
>
> LLM solutions are regarded as **failed** on timeout (and OOM etc.).
> Specifically, we set the dynamic timeout based on the ground-truth solution's runtime.
>
> Additionally, you are **NOT** encouraged to make your test-bed over stressed while running evaluation.
> For example, using `--parallel 64` on a 4-core machine or doing something else during evaluation are bad ideas...

<details><summary>⌨️ More command-line flags <i>:: click to expand ::</i></summary>
<div>

* `--parallel`: by default half of the cores

</div>
</details>

The output should be like (below is GPT-4 greedy decoding example):

```
Asserting the groundtruth...
Expected outputs computed in 1200.0 seconds
Reading samples...
1140it [00:00, 1901.64it/s]
Evaluating samples...
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1140/1140 [19:53<00:00, 6.75it/s]
BigCodeBench-Instruct-calibrated
Groundtruth pass rate: 1.000
pass@1: 0.568
```

- The "k" includes `[1, 5, 10]` where k values `<=` the sample size will be used
- A cache file named like `samples_eval_results.json` will be cached. Remove it to re-run the evaluation

<details><summary>πŸ€” How long it would take? <i>:: click to expand ::</i></summary>
<div>

If you do greedy decoding where there is only one sample for each task, the evaluation should take just a few minutes on Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz, composed of 2 sockets, with 18 cores per socket. However, if you have multiple samples for each task, the evaluation will take longer.
Here are some tips to speed up the evaluation:

* Use `--parallel $(nproc)`
* Use our pre-evaluated results (see [LLM-generated code](#-LLM-generated-code))

</div>
</details>

## Failure Inspection

You can inspect the failed samples by using the following command:

```bash
bigcodebench.inspect --eval-results sample-sanitized-calibrated_eval_results.json --in-place
```

## Full Script

We provide a sample script to run the full pipeline:

```bash
bash run.sh
```

## Result Analysis

We provide a script to replicate the analysis like Elo Rating and Task Solve Rate, which helps you understand the performance of the models further.

```bash
To run the analysis, you need to put all the `samples_eval_results.json` files in a `results` folder, which is in the same directory as the script.

```bash
cd analysis
python get_results.py
```

## πŸ’» LLM-generated Code

We share pre-generated code samples from LLMs we have [evaluated](https://huggingface.co/spaces/bigcode/bigcodebench-leaderboard):
*  See the attachment of our [v0.1.5](https://github.com/bigcode-project/bigcodebench/releases/tag/v0.1.5). We include both `sanitized_samples.zip` and `sanitized_samples_calibrated.zip` for your convenience.

## Known Issues

- [ ] Due to the flakes in the evaluation, the execution results may vary slightly (~0.2%) between runs. We are working on improving the evaluation stability.

- [ ] You may get errors like `ImportError: /usr/local/lib/python3.10/site-packages/matplotlib/_c_internal_utils.cpython-310-x86_64-linux-gnu.so: failed to map segment from shared object` when running the evaluation. This is due to the memory limit of the docker container. You can increase the memory limit of the docker container to solve this issue.

- [ ] We are aware of the issue of some users needing to use a proxy to access the internet. We are working on a subset of the tasks that do not require internet access to evaluate the code.

## πŸ“œ Citation

```bibtex
@article{zhuo2024bigcodebench,
    title={BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions}, 
    author={Terry Yue Zhuo and Minh Chien Vu and Jenny Chim and Han Hu and Wenhao Yu and Ratnadira Widyasari and Imam Nur Bani Yusuf and Haolan Zhan and Junda He and Indraneil Paul and Simon Brunner and Chen Gong and Thong Hoang and Armel Randy Zebaze and Xiaoheng Hong and Wen-Ding Li and Jean Kaddour and Ming Xu and Zhihan Zhang and Prateek Yadav and Naman Jain and Alex Gu and Zhoujun Cheng and Jiawei Liu and Qian Liu and Zijian Wang and David Lo and Binyuan Hui and Niklas Muennighoff and Daniel Fried and Xiaoning Du and Harm de Vries and Leandro Von Werra},
    journal={arXiv preprint arXiv:2406.15877},
    year={2024}
}
```

## πŸ™ Acknowledgement

- [EvalPlus](https://github.com/evalplus/evalplus)

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/bigcode-project/bigcodebench",
    "name": "bigcodebench",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": null,
    "author": null,
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/26/5e/319999716b29460c834b71ab5cbc8499b0c3a510e7c3c4d6a4d4336d1e71/bigcodebench-0.1.7.tar.gz",
    "platform": "any",
    "description": "# BigCodeBench\n<center>\n<img src=\"https://github.com/bigcode-bench/bigcode-bench.github.io/blob/main/asset/bigcodebench_banner.svg?raw=true\" alt=\"BigCodeBench\">\n</center>\n\n<p align=\"center\">\n    <a href=\"https://pypi.org/project/bigcodebench/\"><img src=\"https://img.shields.io/pypi/v/bigcodebench?color=g\"></a>\n    <a href=\"https://hub.docker.com/r/bigcodebench/bigcodebench-evaluate\" title=\"Docker-Eval\"><img src=\"https://img.shields.io/docker/image-size/bigcodebench/bigcodebench-evaluate\"></a>\n    <a href=\"https://hub.docker.com/r/bigcodebench/bigcodebench-generate\" title=\"Docker-Gen\"><img src=\"https://img.shields.io/docker/image-size/bigcodebench/bigcodebench-generate\"></a>\n    <a href=\"https://github.com/bigcodebench/bigcodebench/blob/master/LICENSE\"><img src=\"https://img.shields.io/pypi/l/bigcodebench\"></a>\n</p>\n\n<p align=\"center\">\n    <a href=\"#-about\">\ud83c\udf38About</a> \u2022\n    <a href=\"#-quick-start\">\ud83d\udd25Quick Start</a> \u2022\n    <a href=\"#-llm-generated-code\">\ud83d\udcbbLLM code</a> \u2022\n    <a href=\"#-failure-inspection\">\ud83d\udd0dFailure inspection</a> \u2022\n    <a href=\"#-full-script\">\ud83d\ude80Full Script</a> \u2022\n    <a href=\"#-result-analysis\">\ud83d\udccaResult Analysis</a> \u2022\n    <a href=\"#-known-issues\">\ud83d\udc1eKnown issues</a> \u2022\n    <a href=\"#-citation\">\ud83d\udcdcCitation</a> \u2022\n    <a href=\"#-acknowledgement\">\ud83d\ude4fAcknowledgement</a>\n</p>\n\n## About\n\n### BigCodeBench\n\nBigCodeBench is an **_easy-to-use_** benchmark for code generation with **_practical_** and **_challenging_** programming tasks. It aims to evaluate the true programming capabilities of large language models (LLMs) in a more realistic setting. The benchmark is designed for HumanEval-like function-level code generation tasks, but with much more complex instructions and diverse function calls.\nTo facilitate the evaluation of LLMs on BigCodeBench, we provide this Python package `bigcodebench` that includes the dataset, generation scripts, and evaluation scripts. The package is built on top of the [EvalPlus](https://github.com/evalplus/evalplus) framework, which is a flexible and extensible evaluation framework for code generation tasks.\n\n### Why BigCodeBench?\n\nBigCodeBench focuses on the evaluation of LLM4Code with *diverse function calls* and *complex instruction*, with:\n\n* \u2728 **Precise evaluation & ranking**: See [our leaderboard](https://huggingface.co/spaces/bigcode/bigcodebench-leaderboard) for latest LLM rankings before & after rigorous evaluation.\n* \u2728 **Pre-generated samples**: BigCodeBench accelerates code intelligence research by open-sourcing [LLM-generated samples](#-LLM-generated-code) for various models -- no need to re-run the expensive benchmarks!\n\n### Main Differences from EvalPlus\n\nWe inherit the design of the EvalPlus framework, which is a flexible and extensible evaluation framework for code generation tasks. However, BigCodeBench has the following differences:\n* Execution Environment: The execution environment in BigCodeBench is less bounded than EvalPlus to support tasks with diverse library dependencies.\n* Test Evaluation: BigCodeBench relies on `unittest` for evaluating the generated code, which is more suitable for the test harness in BigCodeBench.\n\n## \ud83d\udd25 Quick Start\n\n> [!Tip]\n>\n> BigCodeBench \u2764\ufe0f [bigcode-evaluation-harness](https://github.com/bigcode-project/bigcode-evaluation-harness)!\n> BigCodeBench will be integrated to bigcode-evaluation-harness, and you can also run it there!\n\nTo get started, please first set up the environment:\n\n```bash\n# Install to use bigcodebench.evaluate\npip install bigcodebench --upgrade\n# If you want to use the evaluate locally, you need to install the requirements\npip install -I -r https://raw.githubusercontent.com/bigcode-project/bigcodebench/main/Requirements/requirements-eval.txt\n\n# Install to use bigcodebench.generate\n# You are strongly recommended to install the generate dependencies in a separate environment\npip install bigcodebench[generate] --upgrade\n```\n\n<details><summary>\u23ec Install nightly version <i>:: click to expand ::</i></summary>\n<div>\n\n```bash\n# Install to use bigcodebench.evaluate\npip install \"git+https://github.com/bigcode-project/bigcodebench.git\" --upgrade\n```\n\n</div>\n</details>\n\n<details><summary>\u23ec Using BigCodeBench as a local repo? <i>:: click to expand ::</i></summary>\n<div>\n\n```bash\ngit clone https://github.com/bigcode-project/bigcodebench.git\ncd bigcodebench\nexport PYTHONPATH=$PYTHONPATH:$(pwd)\n# Install to use bigcodebench.evaluate\npip install -e .\n# Install to use bigcodebench.generate\npip install -e .[generate]\n```\n\n</div>\n</details>\n\n### Code Generation\n\nYou are suggested to use `flash-attn` for generating code samples.\n```bash\npip install -U flash-attn\n```\n\nTo generate code samples from a model, you can use the following command:\n>\n```bash\nbigcodebench.generate \\\n    --model [model_name] \\\n    --subset [complete|instruct] \\\n    --greedy \\\n    --bs [bs] \\\n    --temperature [temp] \\\n    --n_samples [n_samples] \\\n    --resume \\\n    --backend [vllm|hf|openai|mistral|anthropic|google] \\\n    --tp [gpu_number] \\\n    [--trust_remote_code] \\\n    [--base_url [base_url]]\n```\n>\nThe generated code samples will be stored in a file named `[model_name]--bigcodebench-[instruct|complete]--[backend]-[temp]-[n_samples].jsonl`. Alternatively, you can use the following command to utilize our pre-built docker images for generating code samples:\n>\n```bash\n# If you are using GPUs\ndocker run --gpus '\"device=$CUDA_VISIBLE_DEVICES\"' -v $(pwd):/app -t bigcodebench/bigcodebench-generate:latest \\\n    --model [model_name] \\ \n    --subset [complete|instruct] \\\n    [--greedy] \\\n    --bs [bs] \\   \n    --temperature [temp] \\\n    --n_samples [n_samples] \\\n    --resume \\\n    --backend [vllm|hf|openai|mistral|anthropic|google] \\\n    --tp [gpu_number]\n\n# ...Or if you are using CPUs\ndocker run -v $(pwd):/app -t bigcodebench/bigcodebench-generate:latest \\\n    --model [model_name] \\ \n    --subset [complete|instruct] \\\n    [--greedy] \\\n    --bs [bs] \\   \n    --temperature [temp] \\\n    --n_samples [n_samples] \\\n    --resume \\\n    --backend [vllm|hf|openai|mistral|anthropic|google]\n```\n>\n```bash\n# If you wish to use gated or private HuggingFace models and datasets\ndocker run -e HUGGING_FACE_HUB_TOKEN=$token -v $(pwd):/app -t bigcodebench/bigcodebench-generate:latest # omit other arguments4\n\n# Similarly, to use other backends that require authentication\ndocker run -e OPENAI_API_KEY=$OPENAI_API_KEY -v $(pwd):/app -t bigcodebench/bigcodebench-generate:latest # omit other arguments\ndocker run -e GOOGLE_API_KEY=$OPENAI_API_KEY -v $(pwd):/app -t bigcodebench/bigcodebench-generate:latest # omit other arguments\ndocker run -e ANTHROPIC_KEY=$ANTHROPIC_KEY -v $(pwd):/app -t bigcodebench/bigcodebench-generate:latest # omit other arguments\n```\n>\nFollowing which, you can run the built container as shown in above.\n>\n<details><summary>\ud83e\udd14 Structure of `problem`? <i>:: click to expand ::</i></summary>\n<div>\n\n* `task_id` is the identifier string for the task\n* `entry_point` is the name of the function\n* `complete_prompt` is the prompt for BigCodeBench-Complete\n* `instruct_prompt` is the prompt for BigCodeBench-Instruct\n+ `canonical_solution` is the ground-truth implementation\n+ `test` is the `unittest.TestCase` class\n\n</div>\n</details>\n\n> [!Note]\n>\n> **Expected Schema of `[model_name]--bigcodebench-[task]--[backend]-[temp]-[n_samples].jsonl`**\n>\n> 1. `task_id`: Task ID, which are the keys of `get_bigcodebench()`\n> 2. `solution` (optional): Self-contained solution (usually including the prompt)\n>    * Example: `{\"task_id\": \"BigCodeBench/?\", \"solution\": \"def f():\\n    return 1\"}`\n\n### Code Post-processing\n\nLLM-generated text may not be compilable code for including natural language lines or incomplete extra code.\nWe provide a tool namely `bigcodebench.sanitize` to clean up the code:\n\n```bash\n# \ud83d\udca1 If you want to get the calibrated results:\nbigcodebench.sanitize --samples samples.jsonl --calibrate\n# Sanitized code will be produced to `samples-sanitized-calibrated.jsonl`\n\n# \ud83d\udca1 If you want to get the original results:\nbigcodebench.sanitize --samples samples.jsonl\n# Sanitized code will be produced to `samples-sanitized.jsonl`\n\n# \ud83d\udca1 If you are storing codes in directories:\nbigcodebench.sanitize --samples /path/to/vicuna-[??]b_temp_[??]\n# Sanitized code will be produced to `/path/to/vicuna-[??]b_temp_[??]-sanitized`\n```\n\nIf you want to use the pre-built docker images for post-processing, you can use the following command:\n\n```bash\n# Change the entrypoint to bigcodebench.sanitize in any pre-built docker image, like bigcodebench/bigcodebench-evaluate:latest\ndocker run -it --entrypoint bigcodebench.sanitize -v $(pwd):/app bigcodebench/bigcodebench-evaluate:latest --samples samples.jsonl\n```\n\n<details><summary>\ud83d\udd0e Checking the compatibility of post-processed code<i>:: click to expand ::</i></summary>\n<div>\n\nTo double-check the post-processing results, you can use `bigcodebench.syncheck` to check the code validity before and after sanitization, which will print erroneous code snippets and why they are wrong:\n\n```bash\n# \ud83d\udca1 If you are storing codes in jsonl:\nbigcodebench.syncheck --samples samples.jsonl\n\n# \ud83d\udca1 If you are storing codes in directories:\nbigcodebench.syncheck --samples /path/to/vicuna-[??]b_temp_[??]\n\n# \ud83d\udca1 Or change the entrypoint to bigcodebench.syncheck in any pre-built docker image, like \ndocker run -it --entrypoint bigcodebench.syncheck -v $(pwd):/app bigcodebench/bigcodebench-evaluate:latest --samples samples.jsonl\n```\n\n</div>\n</details>\n\n\n### Code Evaluation\n\nYou are strongly recommended to use a sandbox such as [docker](https://docs.docker.com/get-docker/):\n\n```bash\n# Mount the current directory to the container\n# If you want to change the RAM address space limit (in MB, 128 GB by default): `--max-as-limit XXX`\n# If you want to change the RAM data segment limit (in MB, 4 GB by default): `--max-data-limit`\n# If you want to change the RAM stack limit (in MB, 4 MB by default): `--max-stack-limit`\ndocker run -v $(pwd):/app bigcodebench/bigcodebench-evaluate:latest --subset [complete|instruct] --samples samples-sanitized-calibrated.jsonl\n\n# If you only want to check the ground truths\ndocker run -v $(pwd):/app bigcodebench/bigcodebench-evaluate:latest --subset [complete|instruct] --samples samples-sanitized-calibrated.jsonl --check-gt-only\n```\n\n...Or if you want to try it locally regardless of the risks \u26a0\ufe0f:\n\nFirst, install the dependencies for BigCodeBench:\n\n```bash\npip install -r https://raw.githubusercontent.com/bigcode-project/bigcodebench/main/Requirements/requirements-eval.txt\n```\n\nThen, run the evaluation:\n\n```bash\n# ...Or locally \u26a0\ufe0f\nbigcodebench.evaluate --subset [complete|instruct] --samples samples-sanitized-calibrated.jsonl\n# ...If you really don't want to check the ground truths\nbigcodebench.evaluate --subset [complete|instruct] --samples samples-sanitized-calibrated.jsonl --no-gt\n```\n\n> [!Tip]\n>\n> Do you use a very slow machine?\n>\n> LLM solutions are regarded as **failed** on timeout (and OOM etc.).\n> Specifically, we set the dynamic timeout based on the ground-truth solution's runtime.\n>\n> Additionally, you are **NOT** encouraged to make your test-bed over stressed while running evaluation.\n> For example, using `--parallel 64` on a 4-core machine or doing something else during evaluation are bad ideas...\n\n<details><summary>\u2328\ufe0f More command-line flags <i>:: click to expand ::</i></summary>\n<div>\n\n* `--parallel`: by default half of the cores\n\n</div>\n</details>\n\nThe output should be like (below is GPT-4 greedy decoding example):\n\n```\nAsserting the groundtruth...\nExpected outputs computed in 1200.0 seconds\nReading samples...\n1140it [00:00, 1901.64it/s]\nEvaluating samples...\n100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 1140/1140 [19:53<00:00, 6.75it/s]\nBigCodeBench-Instruct-calibrated\nGroundtruth pass rate: 1.000\npass@1: 0.568\n```\n\n- The \"k\" includes `[1, 5, 10]` where k values `<=` the sample size will be used\n- A cache file named like `samples_eval_results.json` will be cached. Remove it to re-run the evaluation\n\n<details><summary>\ud83e\udd14 How long it would take? <i>:: click to expand ::</i></summary>\n<div>\n\nIf you do greedy decoding where there is only one sample for each task, the evaluation should take just a few minutes on Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz, composed of 2 sockets, with 18 cores per socket. However, if you have multiple samples for each task, the evaluation will take longer.\nHere are some tips to speed up the evaluation:\n\n* Use `--parallel $(nproc)`\n* Use our pre-evaluated results (see [LLM-generated code](#-LLM-generated-code))\n\n</div>\n</details>\n\n## Failure Inspection\n\nYou can inspect the failed samples by using the following command:\n\n```bash\nbigcodebench.inspect --eval-results sample-sanitized-calibrated_eval_results.json --in-place\n```\n\n## Full Script\n\nWe provide a sample script to run the full pipeline:\n\n```bash\nbash run.sh\n```\n\n## Result Analysis\n\nWe provide a script to replicate the analysis like Elo Rating and Task Solve Rate, which helps you understand the performance of the models further.\n\n```bash\nTo run the analysis, you need to put all the `samples_eval_results.json` files in a `results` folder, which is in the same directory as the script.\n\n```bash\ncd analysis\npython get_results.py\n```\n\n## \ud83d\udcbb LLM-generated Code\n\nWe share pre-generated code samples from LLMs we have [evaluated](https://huggingface.co/spaces/bigcode/bigcodebench-leaderboard):\n*  See the attachment of our [v0.1.5](https://github.com/bigcode-project/bigcodebench/releases/tag/v0.1.5). We include both `sanitized_samples.zip` and `sanitized_samples_calibrated.zip` for your convenience.\n\n## Known Issues\n\n- [ ] Due to the flakes in the evaluation, the execution results may vary slightly (~0.2%) between runs. We are working on improving the evaluation stability.\n\n- [ ] You may get errors like `ImportError: /usr/local/lib/python3.10/site-packages/matplotlib/_c_internal_utils.cpython-310-x86_64-linux-gnu.so: failed to map segment from shared object` when running the evaluation. This is due to the memory limit of the docker container. You can increase the memory limit of the docker container to solve this issue.\n\n- [ ] We are aware of the issue of some users needing to use a proxy to access the internet. We are working on a subset of the tasks that do not require internet access to evaluate the code.\n\n## \ud83d\udcdc Citation\n\n```bibtex\n@article{zhuo2024bigcodebench,\n    title={BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions}, \n    author={Terry Yue Zhuo and Minh Chien Vu and Jenny Chim and Han Hu and Wenhao Yu and Ratnadira Widyasari and Imam Nur Bani Yusuf and Haolan Zhan and Junda He and Indraneil Paul and Simon Brunner and Chen Gong and Thong Hoang and Armel Randy Zebaze and Xiaoheng Hong and Wen-Ding Li and Jean Kaddour and Ming Xu and Zhihan Zhang and Prateek Yadav and Naman Jain and Alex Gu and Zhoujun Cheng and Jiawei Liu and Qian Liu and Zijian Wang and David Lo and Binyuan Hui and Niklas Muennighoff and Daniel Fried and Xiaoning Du and Harm de Vries and Leandro Von Werra},\n    journal={arXiv preprint arXiv:2406.15877},\n    year={2024}\n}\n```\n\n## \ud83d\ude4f Acknowledgement\n\n- [EvalPlus](https://github.com/evalplus/evalplus)\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "\"Evaluation package for BigCodeBench\"",
    "version": "0.1.7",
    "project_urls": {
        "Homepage": "https://github.com/bigcode-project/bigcodebench"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4be942c1e0cfa4254ff40449772e398d38b4ef862beb0665cde36da8a8e69824",
                "md5": "39039582dc861ed0c89f38e9dd35a88a",
                "sha256": "5f8b87ddc49fd87157a271b54c2691cdfb74e44c44e90ef0c77acbfd6860b57f"
            },
            "downloads": -1,
            "filename": "bigcodebench-0.1.7-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "39039582dc861ed0c89f38e9dd35a88a",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 39497,
            "upload_time": "2024-06-27T23:39:45",
            "upload_time_iso_8601": "2024-06-27T23:39:45.630160Z",
            "url": "https://files.pythonhosted.org/packages/4b/e9/42c1e0cfa4254ff40449772e398d38b4ef862beb0665cde36da8a8e69824/bigcodebench-0.1.7-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "265e319999716b29460c834b71ab5cbc8499b0c3a510e7c3c4d6a4d4336d1e71",
                "md5": "e586bb31c3b2a2e69669606df2c3bf96",
                "sha256": "3bf7b9866429126bd241ab7341bab1c3a1a4c7cb6161c33ee38c1c2741637aad"
            },
            "downloads": -1,
            "filename": "bigcodebench-0.1.7.tar.gz",
            "has_sig": false,
            "md5_digest": "e586bb31c3b2a2e69669606df2c3bf96",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 53993,
            "upload_time": "2024-06-27T23:39:47",
            "upload_time_iso_8601": "2024-06-27T23:39:47.723599Z",
            "url": "https://files.pythonhosted.org/packages/26/5e/319999716b29460c834b71ab5cbc8499b0c3a510e7c3c4d6a4d4336d1e71/bigcodebench-0.1.7.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-06-27 23:39:47",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "bigcode-project",
    "github_project": "bigcodebench",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "bigcodebench"
}
        
Elapsed time: 0.28844s