Name | blade-bench JSON |
Version |
0.1.1
JSON |
| download |
home_page | https://github.com/behavioral-data/BLADE |
Summary | Dataset and code for 'BLADE: Benchmarking Language Model Agents for Data-Driven Science'(https://arxiv.org/abs/2408.09667) |
upload_time | 2024-09-09 04:53:23 |
maintainer | None |
docs_url | None |
author | Ken Gu |
requires_python | <4.0,>=3.10 |
license | None |
keywords |
|
VCS |
 |
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
<h1 align="center">
<img src="https://github.com/behavioral-data/BLADE/raw/main/assets/logo.png" width="100" alt="logo" />
<br>
BLADE: Benchmarking Language Model Agents for Data-Driven Science
</h1>
Dataset and code for ["BLADE: Benchmarking Language Model Agents for Data-Driven Science"](https://arxiv.org/abs/2408.09667)
We are working on a hold-out test set. Details soon!
## 📝 Introduction
BLADE is a comprehensive benchmark designed to evaluate Language Model (LM) Agents on writing justifiable analyses on real-world scientific research questions from data (e.g., _Are soccer players with a dark skin tone more likely than those with a light skin tone to receive red cards from referees?_ from [Silberzahn et al.](https://journals.sagepub.com/doi/10.1177/2515245917747646)). In particular, BLADE evaluates Agents' ability to iteratively integrate scientific domain knowledge, statistical expertise, and data understanding to make nuanced analytical decisions
BLADE consists of X dataset and research question pairs with high-quality ground truth analysis decisions (i.e., choice of conceptual construct, transformations, statistical model) made by expert data scientists and researchers who independently conducted the analyses. In addition, BLADE contains Y multiple choice questions for discerning justifiable analysis decisions.

<p align="left">
<em><b>Overview of BLADE Construction and Evaluation.</b> We gathered research questions and datasets from existing research papers, crowd-sourced analysis studies and statistic textbooks as well as analyses from expert annotators (boxes 1-3). Given a research question and dataset, LM agents generate a full analysis containing the relevant conceptual variables, a data transform function, and a statistical modeling function (boxes 1, 4, and 5). BLADE then performs automatic evaluation against the ground truth (box 6).</em>
</p>
## 🚀 Getting Started
To get started with BLADE, follow the steps below:
### 1. Installation
Clone the repository:
```bash
git clone https://github.com/behavioral-data/BLADE.git
cd BLADE
```
Install locally (developed in python=3.10.14)
```bash
# recommended to do this inside another environment
conda create --name blade python=3.10 -y
conda activate blade
pip install -e .
```
### 2. LM Setup
Next, set the API keys for different LM services. BLADE both not only evalutes Language Models but needs one for evaluation.
```bash
# for openai
export OPENAI_API_KEY=<your key>
# for google gemini
export GEMINI_API_KEY=<your key>
# for anthropic
export ANTHROPIC_API_KEY=<your key>
```
Some default model configurations (e.g., environment variable for the api key) are specified in [llm_config.yml](blade_bench/conf/llm_config.yml). You can also set your own configurations by creating your own yaml file folloing the format in `llm_config.yml` and setting the environment variable `LLM_CONFIG_PATH` to the file.
Here's a minimal example to test that the llm is working.
```python
from blade_bench.llms import llm
gen = llm(provider="anthropic", model="claude-3.5-sonnet")
response = gen.generate([{"role": "user", "content": "Hello world"}])
```
### 3. Running LMs and Agent
We provide a starter script to run a basic one shot LM or ReACT agent for our benchmark.
```
Usage: run_gen_analyses.py [OPTIONS]
For a given dataset and research question, generate analyses for the dataset
using a language model or a basic ReAct agent that interacts with a notebook
environment.
Running this generates the following files in output_dir:
- command.sh: A bash script that contains the command used to run this script
- config.json: The configuration used to run this experiment
- run.log: The log file for the multirun experiment
- llm.log: The log file for LM prompts and responses for the experiment
- multirun_analyses.json: The analyses generated. **Note**: This file is used in run_get_eval.py to get the evaluation results.
- llm_analysis_*.py: The code generated for each run (if it was generated properly) for quick reference
Options:
--run_dataset [fish|boxes|conversation|reading|crofoot|panda_nuts|fertility|hurricane|teachingratings|mortgage|soccer|affairs|amtl|caschools]
Dataset to run [required]
-n, --num_runs INTEGER Number of runs to perform [default: 10]
--use_agent Whether to use agent or just the base LM
--no_cache_code_results [ONLY used when use_agent=True] Whether to
--no_cache_code_results [ONLY used when use_agent=True] Whether to
cache code results when running code.
--no_use_data_desc Whether to use data description in the
prompts for the LM [default: True]
--llm_config_path FILE Path to the LLM config file, used to specify
the provider, model, and text generation
config such as the temperature. [default:
./conf/llm.yaml]
--llm_provider [openai|azureopenai|groq|mistral|together|gemini|anthropic|huggingface]
Provider for the LLM to override the config
file at llm_config_path
--llm_model TEXT Model for the LLM to override the config
file at llm_config_path.
--llm_eval_config_path FILE Path to the LLM eval config file, used to
specify the provider, model, and text
generation config such as the temperature.
[default: ./conf/llm_eval.yaml]
--output_dir DIRECTORY output directory to store saved analyses
--help Show this message and exit.
```
This will write the results to the folder specified by `output_dir`. After running the script, in the output folder, there will be a `multirun_analyses.json` file which is used for evaluation.
An example is provided in [example/multirun_analyses.json](example/multirun_analyses.json).
### 4. Evaluating Agent Generated Analyses
We provide a starter script to evaluate the outputs of `run_gen_analyses.py`. Run `run_get_eval.py` as follows:
```
Usage: run_get_eval.py [OPTIONS]
Runs evaluation and saves the results to the output_dir directory. Running
this saves the following key files:
- command.sh: A bash script that contains the command used to run this script
- eval_results.json of the EvalResults class
- eval_metrics.json of the MetricsAcrossRuns class containing the metrics
- llm_history.json of the LLM history class containing the prompt history
Options:
--multirun_load_path FILE [EITHER multirun_load_path or
submission_load_path is REQUIRED] Path to
load the multirun analyses.
--submission_load_path FILE [EITHER multirun_load_path or
submission_load_path is REQUIRED]
--llm_eval_config_path FILE Path to the LLM eval config file
--no_cache_code_results Whether to not cache code results when
running code for the evaluation
--output_dir DIRECTORY output directory to store saved eval results
--ks TEXT List of k values for diversity metrics.
Default is []
--diversity_n_samples INTEGER Number of samples to use for diversity
metrics
--help Show this message and exit.
```
Here is an example:
```bash
python run_get_eval.py --multirun_load_path ./examples/multirun_analyses.json
```
## 🔍 Data Exploration Functions
To access the dataset and research question we can:
```python
from blade_bench.data import load_dataset_info, list_datasets, DatasetInfo
all_datasets = list_datasets()
dinfo: DatasetInfo = load_dataset_info("soccer", load_df=True)
rq = dinfo.research_question
df = dinfo.df
```
To explore the ground truth annotations, we can:
```python
from blade_bench.data import load_ground_truth, AnnotationDBData
# each dataset annotations will be prepared when it is run the first time
gnd_truth: AnnotationDBData = load_ground_truth('soccer')
print(len(gnd_truth.transform_specs))
print(len(gnd_truth.cv_specs))
```
More details about the structure of the ground truth is available in the paper.
## 🎯 Evaluating a Submission on BLADE
To evalute your own agent analysis for a dataset in BLADE, the LM agent must generate a `json` file that conforms to the schema in [example/submission_schema.json](example/submission_schema.json). An example is shown in [example/submission_analyses.json](example/submission_analyses.json). Then, we just need to specify --submission_load_path when running `run_get_eval.py`.
```bash
python run_get_eval.py --submission_load_path ./example/submission_analyses.json
```
## Citation
If you use our dataset or models in your research, please cite us as follows:
```bibtex
@article{gu2024bladebenchmarkinglanguagemodel,
title={BLADE: Benchmarking Language Model Agents for Data-Driven Science},
author={Ken Gu and Ruoxi Shang and Ruien Jiang and Keying Kuang and Richard-John Lin and Donghe Lyu and Yue Mao and Youran Pan and Teng Wu and Jiaqian Yu and Yikun Zhang and Tianmai M. Zhang and Lanyi Zhu and Mike A. Merrill and Jeffrey Heer and Tim Althoff},
year={2024},
journal = {arXiv},
url={https://arxiv.org/abs/2408.09667},
}
```
Raw data
{
"_id": null,
"home_page": "https://github.com/behavioral-data/BLADE",
"name": "blade-bench",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.10",
"maintainer_email": null,
"keywords": null,
"author": "Ken Gu",
"author_email": "kenqgu@cs.washington.edu",
"download_url": "https://files.pythonhosted.org/packages/4b/21/b0b46ac505b2dfc8081f69adcd116ba6aca69589a821e531584f8243c8e1/blade_bench-0.1.1.tar.gz",
"platform": null,
"description": "<h1 align=\"center\">\n<img src=\"https://github.com/behavioral-data/BLADE/raw/main/assets/logo.png\" width=\"100\" alt=\"logo\" />\n<br>\nBLADE: Benchmarking Language Model Agents for Data-Driven Science\n</h1>\n\nDataset and code for [\"BLADE: Benchmarking Language Model Agents for Data-Driven Science\"](https://arxiv.org/abs/2408.09667)\n\nWe are working on a hold-out test set. Details soon!\n\n## \ud83d\udcdd Introduction\n\nBLADE is a comprehensive benchmark designed to evaluate Language Model (LM) Agents on writing justifiable analyses on real-world scientific research questions from data (e.g., _Are soccer players with a dark skin tone more likely than those with a light skin tone to receive red cards from referees?_ from [Silberzahn et al.](https://journals.sagepub.com/doi/10.1177/2515245917747646)). In particular, BLADE evaluates Agents' ability to iteratively integrate scientific domain knowledge, statistical expertise, and data understanding to make nuanced analytical decisions\n\nBLADE consists of X dataset and research question pairs with high-quality ground truth analysis decisions (i.e., choice of conceptual construct, transformations, statistical model) made by expert data scientists and researchers who independently conducted the analyses. In addition, BLADE contains Y multiple choice questions for discerning justifiable analysis decisions.\n\n\n\n<p align=\"left\">\n <em><b>Overview of BLADE Construction and Evaluation.</b> We gathered research questions and datasets from existing research papers, crowd-sourced analysis studies and statistic textbooks as well as analyses from expert annotators (boxes 1-3). Given a research question and dataset, LM agents generate a full analysis containing the relevant conceptual variables, a data transform function, and a statistical modeling function (boxes 1, 4, and 5). BLADE then performs automatic evaluation against the ground truth (box 6).</em>\n</p>\n\n## \ud83d\ude80 Getting Started\n\nTo get started with BLADE, follow the steps below:\n\n### 1. Installation\n\nClone the repository:\n\n```bash\ngit clone https://github.com/behavioral-data/BLADE.git\ncd BLADE\n```\n\nInstall locally (developed in python=3.10.14)\n\n```bash\n# recommended to do this inside another environment\nconda create --name blade python=3.10 -y\nconda activate blade\npip install -e .\n```\n\n### 2. LM Setup\n\nNext, set the API keys for different LM services. BLADE both not only evalutes Language Models but needs one for evaluation.\n\n```bash\n# for openai\nexport OPENAI_API_KEY=<your key>\n\n# for google gemini\nexport GEMINI_API_KEY=<your key>\n\n# for anthropic\nexport ANTHROPIC_API_KEY=<your key>\n```\n\nSome default model configurations (e.g., environment variable for the api key) are specified in [llm_config.yml](blade_bench/conf/llm_config.yml). You can also set your own configurations by creating your own yaml file folloing the format in `llm_config.yml` and setting the environment variable `LLM_CONFIG_PATH` to the file.\n\nHere's a minimal example to test that the llm is working.\n\n```python\nfrom blade_bench.llms import llm\ngen = llm(provider=\"anthropic\", model=\"claude-3.5-sonnet\")\nresponse = gen.generate([{\"role\": \"user\", \"content\": \"Hello world\"}])\n```\n\n### 3. Running LMs and Agent\n\nWe provide a starter script to run a basic one shot LM or ReACT agent for our benchmark.\n\n```\nUsage: run_gen_analyses.py [OPTIONS]\n\n For a given dataset and research question, generate analyses for the dataset\n using a language model or a basic ReAct agent that interacts with a notebook\n environment.\n\n Running this generates the following files in output_dir:\n\n - command.sh: A bash script that contains the command used to run this script\n - config.json: The configuration used to run this experiment\n - run.log: The log file for the multirun experiment\n - llm.log: The log file for LM prompts and responses for the experiment\n - multirun_analyses.json: The analyses generated. **Note**: This file is used in run_get_eval.py to get the evaluation results.\n - llm_analysis_*.py: The code generated for each run (if it was generated properly) for quick reference\n\nOptions:\n --run_dataset [fish|boxes|conversation|reading|crofoot|panda_nuts|fertility|hurricane|teachingratings|mortgage|soccer|affairs|amtl|caschools]\n Dataset to run [required]\n -n, --num_runs INTEGER Number of runs to perform [default: 10]\n --use_agent Whether to use agent or just the base LM\n --no_cache_code_results [ONLY used when use_agent=True] Whether to\n --no_cache_code_results [ONLY used when use_agent=True] Whether to\n cache code results when running code.\n --no_use_data_desc Whether to use data description in the\n prompts for the LM [default: True]\n --llm_config_path FILE Path to the LLM config file, used to specify\n the provider, model, and text generation\n config such as the temperature. [default:\n ./conf/llm.yaml]\n --llm_provider [openai|azureopenai|groq|mistral|together|gemini|anthropic|huggingface]\n Provider for the LLM to override the config\n file at llm_config_path\n --llm_model TEXT Model for the LLM to override the config\n file at llm_config_path.\n --llm_eval_config_path FILE Path to the LLM eval config file, used to\n specify the provider, model, and text\n generation config such as the temperature.\n [default: ./conf/llm_eval.yaml]\n --output_dir DIRECTORY output directory to store saved analyses\n --help Show this message and exit.\n```\n\nThis will write the results to the folder specified by `output_dir`. After running the script, in the output folder, there will be a `multirun_analyses.json` file which is used for evaluation.\n\nAn example is provided in [example/multirun_analyses.json](example/multirun_analyses.json).\n\n### 4. Evaluating Agent Generated Analyses\n\nWe provide a starter script to evaluate the outputs of `run_gen_analyses.py`. Run `run_get_eval.py` as follows:\n\n```\nUsage: run_get_eval.py [OPTIONS]\n\n Runs evaluation and saves the results to the output_dir directory. Running\n this saves the following key files:\n\n - command.sh: A bash script that contains the command used to run this script\n - eval_results.json of the EvalResults class\n - eval_metrics.json of the MetricsAcrossRuns class containing the metrics\n - llm_history.json of the LLM history class containing the prompt history\n\nOptions:\n --multirun_load_path FILE [EITHER multirun_load_path or\n submission_load_path is REQUIRED] Path to\n load the multirun analyses.\n --submission_load_path FILE [EITHER multirun_load_path or\n submission_load_path is REQUIRED]\n --llm_eval_config_path FILE Path to the LLM eval config file\n --no_cache_code_results Whether to not cache code results when\n running code for the evaluation\n --output_dir DIRECTORY output directory to store saved eval results\n --ks TEXT List of k values for diversity metrics.\n Default is []\n --diversity_n_samples INTEGER Number of samples to use for diversity\n metrics\n --help Show this message and exit.\n```\n\nHere is an example:\n\n```bash\npython run_get_eval.py --multirun_load_path ./examples/multirun_analyses.json\n```\n\n## \ud83d\udd0d Data Exploration Functions\n\nTo access the dataset and research question we can:\n\n```python\nfrom blade_bench.data import load_dataset_info, list_datasets, DatasetInfo\n\nall_datasets = list_datasets()\ndinfo: DatasetInfo = load_dataset_info(\"soccer\", load_df=True)\nrq = dinfo.research_question\ndf = dinfo.df\n```\n\nTo explore the ground truth annotations, we can:\n\n```python\nfrom blade_bench.data import load_ground_truth, AnnotationDBData\n\n# each dataset annotations will be prepared when it is run the first time\ngnd_truth: AnnotationDBData = load_ground_truth('soccer')\nprint(len(gnd_truth.transform_specs))\nprint(len(gnd_truth.cv_specs))\n```\n\nMore details about the structure of the ground truth is available in the paper.\n\n## \ud83c\udfaf Evaluating a Submission on BLADE\n\nTo evalute your own agent analysis for a dataset in BLADE, the LM agent must generate a `json` file that conforms to the schema in [example/submission_schema.json](example/submission_schema.json). An example is shown in [example/submission_analyses.json](example/submission_analyses.json). Then, we just need to specify --submission_load_path when running `run_get_eval.py`.\n\n```bash\npython run_get_eval.py --submission_load_path ./example/submission_analyses.json\n```\n\n## Citation\n\nIf you use our dataset or models in your research, please cite us as follows:\n\n```bibtex\n@article{gu2024bladebenchmarkinglanguagemodel,\n title={BLADE: Benchmarking Language Model Agents for Data-Driven Science},\n author={Ken Gu and Ruoxi Shang and Ruien Jiang and Keying Kuang and Richard-John Lin and Donghe Lyu and Yue Mao and Youran Pan and Teng Wu and Jiaqian Yu and Yikun Zhang and Tianmai M. Zhang and Lanyi Zhu and Mike A. Merrill and Jeffrey Heer and Tim Althoff},\n year={2024},\n journal = {arXiv},\n url={https://arxiv.org/abs/2408.09667},\n}\n```\n",
"bugtrack_url": null,
"license": null,
"summary": "Dataset and code for 'BLADE: Benchmarking Language Model Agents for Data-Driven Science'(https://arxiv.org/abs/2408.09667)",
"version": "0.1.1",
"project_urls": {
"Homepage": "https://github.com/behavioral-data/BLADE",
"Repository": "https://github.com/behavioral-data/BLADE"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "f6ca9b3c440521cb49b5f69358f9a9bef90caf50eba5ff3fbd0cf27e2d712008",
"md5": "4b8aa6630681b93dff8678f34c03b61d",
"sha256": "7bc55f799e6cb406213c02e202355a454df5fc83b2445f1eaf6119ce8a227ccc"
},
"downloads": -1,
"filename": "blade_bench-0.1.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "4b8aa6630681b93dff8678f34c03b61d",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.10",
"size": 14260624,
"upload_time": "2024-09-09T04:53:15",
"upload_time_iso_8601": "2024-09-09T04:53:15.387490Z",
"url": "https://files.pythonhosted.org/packages/f6/ca/9b3c440521cb49b5f69358f9a9bef90caf50eba5ff3fbd0cf27e2d712008/blade_bench-0.1.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "4b21b0b46ac505b2dfc8081f69adcd116ba6aca69589a821e531584f8243c8e1",
"md5": "7d48d6602bce60c084a43e7a5d45ce64",
"sha256": "ce96bc2118d45788569a094361e7189f5af9eb4eaf5886979ba5f12b34a09a9f"
},
"downloads": -1,
"filename": "blade_bench-0.1.1.tar.gz",
"has_sig": false,
"md5_digest": "7d48d6602bce60c084a43e7a5d45ce64",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.10",
"size": 14071062,
"upload_time": "2024-09-09T04:53:23",
"upload_time_iso_8601": "2024-09-09T04:53:23.440986Z",
"url": "https://files.pythonhosted.org/packages/4b/21/b0b46ac505b2dfc8081f69adcd116ba6aca69589a821e531584f8243c8e1/blade_bench-0.1.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-09-09 04:53:23",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "behavioral-data",
"github_project": "BLADE",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "blade-bench"
}