gtrbench

Name	gtrbench JSON
Version	0.0.1 JSON
	download
home_page	None
Summary	A benchmark to evaluate implicit reasoning in LLMs using guess-the-rule games
upload_time	2025-01-19 01:58:11
maintainer	None
docs_url	None
author	Ali Shazal (with Michael Lu, Xiang Zheng, Juno Lee, Arihant Choudhary)
requires_python	>=3.7
license	MIT
keywords	benchmark llm implicit reasoning guess-the-rule games gtrbench
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # GuessTheRuleBench - GTRBench

Welcome to **GuessTheRuleBench** (pypi library name **gtrbench**), a dynamic benchmark designed to evaluate implicit rule deduction capabilities of Large Language Models (LLMs) through "guess-the-rule" games. This repository contains:

- The code for running the benchmark via a Python library.
- A web application demo where human user can play the games or watch LLM agents interact with the system in real-time.
- Experiment results and a research paper detailing the methodology and findings.

## High-Level System Design Diagram

Below is a high-level system design diagram that illustrates the various components, their interactions, and the overall workflow of GuessTheRuleBench:

![](/docs/final-proj-system-design.png)

## Research Paper and Demo Presentation

For a complete understanding of the methodology, experiments, and analysis, please refer below

[**Research Paper PDF**](/docs/GuessTheRuleBench.pdf)

[**Demo Presentation Video**](https://www.youtube.com/watch?v=zgnqCNjr5H4)

[**Demo Slides**](https://docs.google.com/presentation/d/1_7-yq9PsrscZz8_R5mHI-rbg5dOxJNWxV9zhrfWpCSI/edit?usp=sharing)


## System Requirements

**Python 3.9 or below** is required to run the Python library and backend services. Use a conda environment to avoid installing libraries globally:
```bash
conda create -n guess_the_rule_env python=3.9
conda activate guess_the_rule_env
```

## For Agentic Use: Running the Benchmark Python Library
The Python library provides four game classes:

- ```StaticGoingOnAPicnic()``` for the Static Picnic game
- ```DynamicGoingOnAPicnic()``` for the Dynamic Picnic game
- ```CodeFunctionsPicnic()``` for the Code Functions Picnic game
- ```MathGuessTheRuleGame()``` for the Math game

Each class exposes the following methods:

- ```create_game_instance()``` to request a new instance of the game.
- ```get_more_examples(N)``` to request N more examples.
- ```validate_guess(guess)``` to present the user's guess for validation.
- ```get_game_summary()``` to retrieve the performance summary of the current game.
- ```load_game(uuid)``` to load a previously generated game instance.

### Test Code for Static Picnic Game:

```python
from lib.domain.picnic.static_picnic.base import StaticGoingOnAPicnic

# Get a new object for the static picnic game
static_picnic_obj = StaticGoingOnAPicnic(
    difficulty='L1',
    num_init_examples=2
)

# Create a new game instance
static_picnic_obj.create_game_instance()

# Request more examples
static_picnic_obj.get_more_examples(n=1)
static_picnic_obj.get_more_examples(n=2)
static_picnic_obj.get_more_examples(n=3)

# Validate guess
static_picnic_obj.validate_guess(guess='Items from the category kitchen appliances')

# Get game summary
static_picnic_obj.get_game_summary()

# Load an existing game and check its summary
loaded_game = StaticGoingOnAPicnic.load_game('650499e9-a5da-4129-b426-8d6517bf65e6')
loaded_game.get_game_summary(include_rule=True)
```

## Web Application UI
Below are some screenshots showcasing the web application that we created to demo our benchmark:
1. **Landing Page**
![](/docs/landing-page.png)

2. **Docs Page**
![](/docs/docs.png)

3. **Game Play**

Either start a new game or load existing game using an already generated game UUID.
![](/docs/choose-gameplay.png)

Select the game configurations to start a new game.
![](/docs/start-new-game.png)

Game play UI.
![](/docs/play.png)

## Benchmark Experiments High-Level Results
Below is a summary of the average win rate of different models across all games and difficulty levels. Bold values highlight the best performances
in their respective columns.
![](/docs/evaluation-results.png)

## Feedback and Contribution
If you have any suggestions, issues, or contributions, please feel free to open an issue or submit a pull request. We appreciate your interest and support in improving GuessTheRuleBench.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "gtrbench",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "benchmark, llm, implicit reasoning, guess-the-rule games, gtrbench",
    "author": "Ali Shazal (with Michael Lu, Xiang Zheng, Juno Lee, Arihant Choudhary)",
    "author_email": "ali.shazal@berkeley.edu",
    "download_url": "https://files.pythonhosted.org/packages/be/67/7c0be3d13ed77508da667e8308dc8ac055c67f93ec2cfa35391efbe3a7bf/gtrbench-0.0.1.tar.gz",
    "platform": null,
    "description": "# GuessTheRuleBench - GTRBench\n\nWelcome to **GuessTheRuleBench** (pypi library name **gtrbench**), a dynamic benchmark designed to evaluate implicit rule deduction capabilities of Large Language Models (LLMs) through \"guess-the-rule\" games. This repository contains:\n\n- The code for running the benchmark via a Python library.\n- A web application demo where human user can play the games or watch LLM agents interact with the system in real-time.\n- Experiment results and a research paper detailing the methodology and findings.\n\n## High-Level System Design Diagram\n\nBelow is a high-level system design diagram that illustrates the various components, their interactions, and the overall workflow of GuessTheRuleBench:\n\n![](/docs/final-proj-system-design.png)\n\n## Research Paper and Demo Presentation\n\nFor a complete understanding of the methodology, experiments, and analysis, please refer below\n\n[**Research Paper PDF**](/docs/GuessTheRuleBench.pdf)\n\n[**Demo Presentation Video**](https://www.youtube.com/watch?v=zgnqCNjr5H4)\n\n[**Demo Slides**](https://docs.google.com/presentation/d/1_7-yq9PsrscZz8_R5mHI-rbg5dOxJNWxV9zhrfWpCSI/edit?usp=sharing)\n\n\n## System Requirements\n\n**Python 3.9 or below** is required to run the Python library and backend services. Use a conda environment to avoid installing libraries globally:\n```bash\nconda create -n guess_the_rule_env python=3.9\nconda activate guess_the_rule_env\n```\n\n## For Agentic Use: Running the Benchmark Python Library\nThe Python library provides four game classes:\n\n- ```StaticGoingOnAPicnic()``` for the Static Picnic game\n- ```DynamicGoingOnAPicnic()``` for the Dynamic Picnic game\n- ```CodeFunctionsPicnic()``` for the Code Functions Picnic game\n- ```MathGuessTheRuleGame()``` for the Math game\n\nEach class exposes the following methods:\n\n- ```create_game_instance()``` to request a new instance of the game.\n- ```get_more_examples(N)``` to request N more examples.\n- ```validate_guess(guess)``` to present the user's guess for validation.\n- ```get_game_summary()``` to retrieve the performance summary of the current game.\n- ```load_game(uuid)``` to load a previously generated game instance.\n\n### Test Code for Static Picnic Game:\n\n```python\nfrom lib.domain.picnic.static_picnic.base import StaticGoingOnAPicnic\n\n# Get a new object for the static picnic game\nstatic_picnic_obj = StaticGoingOnAPicnic(\n    difficulty='L1',\n    num_init_examples=2\n)\n\n# Create a new game instance\nstatic_picnic_obj.create_game_instance()\n\n# Request more examples\nstatic_picnic_obj.get_more_examples(n=1)\nstatic_picnic_obj.get_more_examples(n=2)\nstatic_picnic_obj.get_more_examples(n=3)\n\n# Validate guess\nstatic_picnic_obj.validate_guess(guess='Items from the category kitchen appliances')\n\n# Get game summary\nstatic_picnic_obj.get_game_summary()\n\n# Load an existing game and check its summary\nloaded_game = StaticGoingOnAPicnic.load_game('650499e9-a5da-4129-b426-8d6517bf65e6')\nloaded_game.get_game_summary(include_rule=True)\n```\n\n## Web Application UI\nBelow are some screenshots showcasing the web application that we created to demo our benchmark:\n1. **Landing Page**\n![](/docs/landing-page.png)\n\n2. **Docs Page**\n![](/docs/docs.png)\n\n3. **Game Play**\n\nEither start a new game or load existing game using an already generated game UUID.\n![](/docs/choose-gameplay.png)\n\nSelect the game configurations to start a new game.\n![](/docs/start-new-game.png)\n\nGame play UI.\n![](/docs/play.png)\n\n## Benchmark Experiments High-Level Results\nBelow is a summary of the average win rate of different models across all games and difficulty levels. Bold values highlight the best performances\nin their respective columns.\n![](/docs/evaluation-results.png)\n\n## Feedback and Contribution\nIf you have any suggestions, issues, or contributions, please feel free to open an issue or submit a pull request. We appreciate your interest and support in improving GuessTheRuleBench.\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A benchmark to evaluate implicit reasoning in LLMs using guess-the-rule games",
    "version": "0.0.1",
    "project_urls": null,
    "split_keywords": [
        "benchmark",
        " llm",
        " implicit reasoning",
        " guess-the-rule games",
        " gtrbench"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "60160f6cb4eb1976b0899f3cac6019a6b1adf6ba5131d6b0e2bee4538d555afd",
                "md5": "e9fcbe5fc08216b17d4dc944402bbb35",
                "sha256": "37bfb764d307f015625d7b95a9364dbe26efaaca850b92c94a9a3ca7b2d7be21"
            },
            "downloads": -1,
            "filename": "gtrbench-0.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e9fcbe5fc08216b17d4dc944402bbb35",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 25050,
            "upload_time": "2025-01-19T01:58:09",
            "upload_time_iso_8601": "2025-01-19T01:58:09.759224Z",
            "url": "https://files.pythonhosted.org/packages/60/16/0f6cb4eb1976b0899f3cac6019a6b1adf6ba5131d6b0e2bee4538d555afd/gtrbench-0.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "be677c0be3d13ed77508da667e8308dc8ac055c67f93ec2cfa35391efbe3a7bf",
                "md5": "389ef45a127baa9fb1aae7bb97db0f8d",
                "sha256": "e9a15dfb4cc0de43f42a7254d820c3bbe48cc40a71ab0d3a5ebb8271320e42fa"
            },
            "downloads": -1,
            "filename": "gtrbench-0.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "389ef45a127baa9fb1aae7bb97db0f8d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 21931,
            "upload_time": "2025-01-19T01:58:11",
            "upload_time_iso_8601": "2025-01-19T01:58:11.730689Z",
            "url": "https://files.pythonhosted.org/packages/be/67/7c0be3d13ed77508da667e8308dc8ac055c67f93ec2cfa35391efbe3a7bf/gtrbench-0.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-19 01:58:11",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "gtrbench"
}

Ali Shazal (with Michael Lu, Xiang Zheng, Juno Lee, Arihant Choudhary)