# GuessTheRuleBench - GTRBench
Welcome to **GuessTheRuleBench** (pypi library name **gtrbench**), a dynamic benchmark designed to evaluate implicit rule deduction capabilities of Large Language Models (LLMs) through "guess-the-rule" games. This repository contains:
- The code for running the benchmark via a Python library.
- A web application demo where human user can play the games or watch LLM agents interact with the system in real-time.
- Experiment results and a research paper detailing the methodology and findings.
## High-Level System Design Diagram
Below is a high-level system design diagram that illustrates the various components, their interactions, and the overall workflow of GuessTheRuleBench:
![](/docs/final-proj-system-design.png)
## Research Paper and Demo Presentation
For a complete understanding of the methodology, experiments, and analysis, please refer below
[**Research Paper PDF**](/docs/GuessTheRuleBench.pdf)
[**Demo Presentation Video**](https://www.youtube.com/watch?v=zgnqCNjr5H4)
[**Demo Slides**](https://docs.google.com/presentation/d/1_7-yq9PsrscZz8_R5mHI-rbg5dOxJNWxV9zhrfWpCSI/edit?usp=sharing)
## System Requirements
**Python 3.9 or below** is required to run the Python library and backend services. Use a conda environment to avoid installing libraries globally:
```bash
conda create -n guess_the_rule_env python=3.9
conda activate guess_the_rule_env
```
## For Agentic Use: Running the Benchmark Python Library
The Python library provides four game classes:
- ```StaticGoingOnAPicnic()``` for the Static Picnic game
- ```DynamicGoingOnAPicnic()``` for the Dynamic Picnic game
- ```CodeFunctionsPicnic()``` for the Code Functions Picnic game
- ```MathGuessTheRuleGame()``` for the Math game
Each class exposes the following methods:
- ```create_game_instance()``` to request a new instance of the game.
- ```get_more_examples(N)``` to request N more examples.
- ```validate_guess(guess)``` to present the user's guess for validation.
- ```get_game_summary()``` to retrieve the performance summary of the current game.
- ```load_game(uuid)``` to load a previously generated game instance.
### Test Code for Static Picnic Game:
```python
from lib.domain.picnic.static_picnic.base import StaticGoingOnAPicnic
# Get a new object for the static picnic game
static_picnic_obj = StaticGoingOnAPicnic(
difficulty='L1',
num_init_examples=2
)
# Create a new game instance
static_picnic_obj.create_game_instance()
# Request more examples
static_picnic_obj.get_more_examples(n=1)
static_picnic_obj.get_more_examples(n=2)
static_picnic_obj.get_more_examples(n=3)
# Validate guess
static_picnic_obj.validate_guess(guess='Items from the category kitchen appliances')
# Get game summary
static_picnic_obj.get_game_summary()
# Load an existing game and check its summary
loaded_game = StaticGoingOnAPicnic.load_game('650499e9-a5da-4129-b426-8d6517bf65e6')
loaded_game.get_game_summary(include_rule=True)
```
## Web Application UI
Below are some screenshots showcasing the web application that we created to demo our benchmark:
1. **Landing Page**
![](/docs/landing-page.png)
2. **Docs Page**
![](/docs/docs.png)
3. **Game Play**
Either start a new game or load existing game using an already generated game UUID.
![](/docs/choose-gameplay.png)
Select the game configurations to start a new game.
![](/docs/start-new-game.png)
Game play UI.
![](/docs/play.png)
## Benchmark Experiments High-Level Results
Below is a summary of the average win rate of different models across all games and difficulty levels. Bold values highlight the best performances
in their respective columns.
![](/docs/evaluation-results.png)
## Feedback and Contribution
If you have any suggestions, issues, or contributions, please feel free to open an issue or submit a pull request. We appreciate your interest and support in improving GuessTheRuleBench.
Raw data
{
"_id": null,
"home_page": null,
"name": "gtrbench",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": null,
"keywords": "benchmark, llm, implicit reasoning, guess-the-rule games, gtrbench",
"author": "Ali Shazal (with Michael Lu, Xiang Zheng, Juno Lee, Arihant Choudhary)",
"author_email": "ali.shazal@berkeley.edu",
"download_url": "https://files.pythonhosted.org/packages/be/67/7c0be3d13ed77508da667e8308dc8ac055c67f93ec2cfa35391efbe3a7bf/gtrbench-0.0.1.tar.gz",
"platform": null,
"description": "# GuessTheRuleBench - GTRBench\n\nWelcome to **GuessTheRuleBench** (pypi library name **gtrbench**), a dynamic benchmark designed to evaluate implicit rule deduction capabilities of Large Language Models (LLMs) through \"guess-the-rule\" games. This repository contains:\n\n- The code for running the benchmark via a Python library.\n- A web application demo where human user can play the games or watch LLM agents interact with the system in real-time.\n- Experiment results and a research paper detailing the methodology and findings.\n\n## High-Level System Design Diagram\n\nBelow is a high-level system design diagram that illustrates the various components, their interactions, and the overall workflow of GuessTheRuleBench:\n\n![](/docs/final-proj-system-design.png)\n\n## Research Paper and Demo Presentation\n\nFor a complete understanding of the methodology, experiments, and analysis, please refer below\n\n[**Research Paper PDF**](/docs/GuessTheRuleBench.pdf)\n\n[**Demo Presentation Video**](https://www.youtube.com/watch?v=zgnqCNjr5H4)\n\n[**Demo Slides**](https://docs.google.com/presentation/d/1_7-yq9PsrscZz8_R5mHI-rbg5dOxJNWxV9zhrfWpCSI/edit?usp=sharing)\n\n\n## System Requirements\n\n**Python 3.9 or below** is required to run the Python library and backend services. Use a conda environment to avoid installing libraries globally:\n```bash\nconda create -n guess_the_rule_env python=3.9\nconda activate guess_the_rule_env\n```\n\n## For Agentic Use: Running the Benchmark Python Library\nThe Python library provides four game classes:\n\n- ```StaticGoingOnAPicnic()``` for the Static Picnic game\n- ```DynamicGoingOnAPicnic()``` for the Dynamic Picnic game\n- ```CodeFunctionsPicnic()``` for the Code Functions Picnic game\n- ```MathGuessTheRuleGame()``` for the Math game\n\nEach class exposes the following methods:\n\n- ```create_game_instance()``` to request a new instance of the game.\n- ```get_more_examples(N)``` to request N more examples.\n- ```validate_guess(guess)``` to present the user's guess for validation.\n- ```get_game_summary()``` to retrieve the performance summary of the current game.\n- ```load_game(uuid)``` to load a previously generated game instance.\n\n### Test Code for Static Picnic Game:\n\n```python\nfrom lib.domain.picnic.static_picnic.base import StaticGoingOnAPicnic\n\n# Get a new object for the static picnic game\nstatic_picnic_obj = StaticGoingOnAPicnic(\n difficulty='L1',\n num_init_examples=2\n)\n\n# Create a new game instance\nstatic_picnic_obj.create_game_instance()\n\n# Request more examples\nstatic_picnic_obj.get_more_examples(n=1)\nstatic_picnic_obj.get_more_examples(n=2)\nstatic_picnic_obj.get_more_examples(n=3)\n\n# Validate guess\nstatic_picnic_obj.validate_guess(guess='Items from the category kitchen appliances')\n\n# Get game summary\nstatic_picnic_obj.get_game_summary()\n\n# Load an existing game and check its summary\nloaded_game = StaticGoingOnAPicnic.load_game('650499e9-a5da-4129-b426-8d6517bf65e6')\nloaded_game.get_game_summary(include_rule=True)\n```\n\n## Web Application UI\nBelow are some screenshots showcasing the web application that we created to demo our benchmark:\n1. **Landing Page**\n![](/docs/landing-page.png)\n\n2. **Docs Page**\n![](/docs/docs.png)\n\n3. **Game Play**\n\nEither start a new game or load existing game using an already generated game UUID.\n![](/docs/choose-gameplay.png)\n\nSelect the game configurations to start a new game.\n![](/docs/start-new-game.png)\n\nGame play UI.\n![](/docs/play.png)\n\n## Benchmark Experiments High-Level Results\nBelow is a summary of the average win rate of different models across all games and difficulty levels. Bold values highlight the best performances\nin their respective columns.\n![](/docs/evaluation-results.png)\n\n## Feedback and Contribution\nIf you have any suggestions, issues, or contributions, please feel free to open an issue or submit a pull request. We appreciate your interest and support in improving GuessTheRuleBench.\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A benchmark to evaluate implicit reasoning in LLMs using guess-the-rule games",
"version": "0.0.1",
"project_urls": null,
"split_keywords": [
"benchmark",
" llm",
" implicit reasoning",
" guess-the-rule games",
" gtrbench"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "60160f6cb4eb1976b0899f3cac6019a6b1adf6ba5131d6b0e2bee4538d555afd",
"md5": "e9fcbe5fc08216b17d4dc944402bbb35",
"sha256": "37bfb764d307f015625d7b95a9364dbe26efaaca850b92c94a9a3ca7b2d7be21"
},
"downloads": -1,
"filename": "gtrbench-0.0.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "e9fcbe5fc08216b17d4dc944402bbb35",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 25050,
"upload_time": "2025-01-19T01:58:09",
"upload_time_iso_8601": "2025-01-19T01:58:09.759224Z",
"url": "https://files.pythonhosted.org/packages/60/16/0f6cb4eb1976b0899f3cac6019a6b1adf6ba5131d6b0e2bee4538d555afd/gtrbench-0.0.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "be677c0be3d13ed77508da667e8308dc8ac055c67f93ec2cfa35391efbe3a7bf",
"md5": "389ef45a127baa9fb1aae7bb97db0f8d",
"sha256": "e9a15dfb4cc0de43f42a7254d820c3bbe48cc40a71ab0d3a5ebb8271320e42fa"
},
"downloads": -1,
"filename": "gtrbench-0.0.1.tar.gz",
"has_sig": false,
"md5_digest": "389ef45a127baa9fb1aae7bb97db0f8d",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 21931,
"upload_time": "2025-01-19T01:58:11",
"upload_time_iso_8601": "2025-01-19T01:58:11.730689Z",
"url": "https://files.pythonhosted.org/packages/be/67/7c0be3d13ed77508da667e8308dc8ac055c67f93ec2cfa35391efbe3a7bf/gtrbench-0.0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-01-19 01:58:11",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "gtrbench"
}