thumb

Name	thumb JSON
Version	0.2.9 JSON
	download
home_page	https://github.com/hammer-mt/thumb
Summary	A simple prompt testing library for LLMs.
upload_time	2023-12-12 14:35:04
maintainer
docs_url	None
author	Mike Taylor
requires_python
license	MIT
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # thumb

A simple prompt testing library for LLMs.

## Quick start

### 1. Install the library

> `pip install thumb`

### 2. Set up a test

```Python
import os
import thumb

# Set your API key: https://platform.openai.com/account/api-keys
os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY_HERE"

# set up a prompt templates for the a/b test
prompt_a = "tell me a joke"
prompt_b = "tell me a family friendly joke"

# generate the responses
test = thumb.test([prompt_a, prompt_b])
```

### 3. Rate the responses

Each prompt is run 10 times asynchronously by default, which is around 9x faster than running them sequentially. In Jupyter Notebooks a simple user interface is displayed for blind rating responses (you don't see which prompt generated the response).

![image](/img/thumb.png)

Once all responses have been rated, the following performance statistics are calculated broken down by prompt template:
- `avg_score` amount of positive feedback as a percentage of all runs
- `avg_tokens`: how many tokens were used across the prompt and response
- `avg_cost`: an estimate of how much the prompt cost to run on average

A simple report is displayed in the notebook, and the full data is saved to a CSV file `thumb/ThumbTest-{TestID}.csv`.

![image](/img/eval.png)

## Functionality

### Test cases

Test cases are when you want to test a prompt template with different input variables. For example, if you want to test a prompt template that includes a variable for a comedian's name, you can set up test cases for different comedians.

```Python
# set up a prompt templates for the a/b test
prompt_a = "tell me a joke in the style of {comedian}"
prompt_b = "tell me a family friendly joke in the style of {comedian}"

# set test cases with different input variables
cases = [
  {"comedian": "chris rock"}, 
  {"comedian": "ricky gervais"}, 
  {"comedian": "robin williams"}
  ]

# generate the responses
test = thumb.test([prompt_a, prompt_b], cases)
```

Every test case will be run against every prompt template, so in this example you'll get 6 combinations (3 test cases x 2 prompt templates), which will each run 10 times (60 total calls to OpenAI). Every test case must include a value for each variable in the prompt template.

Prompts may have multiple variables in each test case. For example, if you want to test a prompt template that includes a variable for a comedian's name and a joke topic, you can set up test cases for different comedians and topics.

```Python
# set up a prompt templates for the a/b test
prompt_a = "tell me a joke about {subject} in the style of {comedian}"
prompt_b = "tell me a family friendly joke about {subject} in the style of {comedian}"

# set test cases with different input variables
cases = [
  {"subject": "joe biden", "comedian": "chris rock"}, 
  {"subject": "joe biden", "comedian": "ricky gervais"}, 
  {"subject": "donald trump", "comedian": "chris rock"}, 
  {"subject": "donald trump", "comedian": "ricky gervais"}, 
  ]

# generate the responses
test = thumb.test([prompt_a, prompt_b], cases)
```

Every case is tested against every prompt, in order to get a fair comparison of the performance of each prompt given the same input data. With 4 test cases and 2 prompts, you'll get 8 combinations (4 test cases x 2 prompt templates), which will each run 10 times (80 total calls to OpenAI).

### Model testing

```Python
# set up a prompt templates for the a/b test
prompt_a = "tell me a joke"
prompt_b = "tell me a family friendly joke"

# generate the responses
test = thumb.test([prompt_a, prompt_b], models=["gpt-4", "gpt-3.5-turbo"])
```

This will run each prompt against each model, in order to get a fair comparison of the performance of each prompt given the same input data. With 2 prompts and 2 models, you'll get 4 combinations (2 prompts x 2 models), which will each run 10 times (40 total calls to OpenAI).

### System messages

```Python
# set up a prompt templates for the a/b test
system_message = "You are the comedian {comedian}"

prompt_a = [system_message, "tell me a funny joke about {subject}"]
prompt_b = [system_message, "tell me a hillarious joke {subject}"]

cases = [{"subject": "joe biden", "comedian": "chris rock"}, 
         {"subject": "donald trump", "comedian": "chris rock"}]

# generate the responses
test = thumb.test([prompt_a, prompt_b], cases)
```

Prompts can be a string or an array of strings. If the prompt is an array, the first string is used as a system message, and the rest of the prompts alternate between Human and Assistant messages (`[system, human, ai, human, ai, ...]`). This is useful for testing prompts that include a system message, or that are using pre-warming (inserting prior messages into the chat to guide the AI towards desired behavior).

```Python
# set up a prompt templates for the a/b test
system_message = "You are the comedian {comedian}"

prompt_a = [system_message, # system
            "tell me a funny joke about {subject}", # human
            "Sorry, as an AI language model, I am not capable of humor", # assistant
            "That's fine just try your best"] # human
prompt_b = [system_message, # system
            "tell me a hillarious joke about {subject}", # human
            "Sorry, as an AI language model, I am not capable of humor", # assistant
            "That's fine just try your best"] # human

cases = [{"subject": "joe biden", "comedian": "chris rock"}, 
         {"subject": "donald trump", "comedian": "chris rock"}]

# generate the responses
test = thumb.test([prompt_a, prompt_b], cases)
```

### Evaluation report

When the test completes, you get a full evaluation report, broken down by PID, CID, and model, as well as an overall report broken down by all combinations. If you only test one model or one case, these breakdowns will be dropped. The report shows a key at the bottom to see which ID corresponds to which prompt or case.

![image](/img/report.png)

### Parameters

The `thumb.test` function takes the following parameters:

#### Required

- **prompts**: an array of prompts (strings) to be tested

#### Optional

- **cases**: a dictionary of variables to input into each prompt template (default: `None`)
- **runs**: the number of responses to generate per prompt and test case (default: `10`)
- **models**: a list of OpenAI models you want to generate responses from (default: [`gpt-3.5-turbo`])
- **async_generate**: a boolean that denotes whether to run async or sequentially (default: `True`)

If you have 10 test runs with 2 prompt templates and 3 test cases, that's `10 x 2 x 3 = 60` calls to OpenAI. Be careful: particularly with GPT-4 the costs can add up quickly!

Langchain tracing to [LangSmith](https://smith.langchain.com/) is automatically enabled if the `LANGCHAIN_API_KEY` is set as an environment variable (optional).

### Loading and adding

the `.test()` function returns a `ThumbTest` object. You can add more prompts or cases to the test, or run it additional times. You can also generate, evaluate and export the test data at any time.

```Python
# set up a prompt templates for the a/b test
prompt_a = "tell me a joke"
prompt_b = "tell me a family friendly joke"

# generate the responses
test = thumb.test([prompt_a, prompt_b])

# add more prompts
test.add_prompts(["tell me a knock knock joke", "tell me a knock knock joke about {subject}"])

# add more cases
test.add_cases([{"subject": "joe biden"}, {"subject": "donald trump"}])

# run each prompt and case 5 more times
test.add_runs(5)

# generate the responses
test.generate()

# rate the responses
test.evaluate()

# export the test data for analysis
test.export_to_csv()
```

Every prompt template gets the same input data from every test case, but the prompt does not need to use all of the variables in the test case. As in the example above, the `tell me a knock knock joke` prompt does not use the `subject` variable, but it is still generated once (with no variables) for each test case.

Test data is cached in a local JSON file `thumb/.cache/{TestID}.json` after every set of runs is generated for a prompt and case combination.
If your test is interrupted, or you want to add to it, you can use the `thumb.load` function to load the test data from the cache.

```Python
# load a previous test
test_id = "abcd1234" # replace with your test id
test = thumb.load(f"thumb/.cache/{test_id}.json")

# run each prompt and case 2 more times
test.add_runs(2)

# generate the responses
test.generate()

# rate the responses
test.evaluate()

# export the test data for analysis
test.export_to_csv()
```
Every run for each combination of prompt and case is stored in the object (and cache), and therefore calling `test.generate()` again will not generate any new responses if more prompts, cases, or runs aren't added. Similarly, calling `test.evaluate()` again will not re-rate the responses you have already rated, and will simply redisplay the results if the test has ended.

## Thumb Testing 👍🧪

The difference between people just playing around with ChatGPT and those [using AI in production](https://huyenchip.com/2023/04/11/llm-engineering.html) is evaluation. LLMs respond non-deterministically, and so it's important to test what results look like when scaled up across a wide range of scenarios. Without an evaluation framework you're left blindly guessing about what's working in your prompts (or not).

Serious [prompt engineers](https://www.saxifrage.xyz/post/prompt-engineering) are testing and learning which inputs lead to useful or desired outputs, reliably and at scale. This process is called [prompt optimization](https://www.saxifrage.xyz/post/prompt-optimization), and it looks like this:

1. Metrics – Establish how you'll measure the performance of the responses from the AI.
2. Hypothesis – Design one or more prompts that may work, based on the latest research.
3. Testing – Generate responses for your different prompts against multiple test cases.
4. Analysis – Evaluate the performance of your prompts and use them to inform the next test.

Thumb testing fills the gap between large scale professional evaluation mechanisms, and [blindly prompting](https://mitchellh.com/writing/prompt-engineering-vs-blind-prompting) through trial and error. If you are transitioning a prompt into a production environment, using `thumb` to test your prompt can help you catch edge cases, and get early user or team feedback on the results.

## Contributors

These people are building `thumb` for fun in their spare time. 👍

<!-- ALL-CONTRIBUTORS-LIST:START - Do not remove or modify this section -->
<!-- prettier-ignore-start -->
<!-- markdownlint-disable -->
<table>
  <tr>
    <td align="center"><a href="https://twitter.com/hammer_mt"><img src="https://avatars.githubusercontent.com/u/5264596?s=96&v=4" width="100px;" alt=""/><br /><sub><b>hammer-mt</b></sub></a><br /><a href="https://github.com/hammer-mt/thumb/commits?author=hammer-mt" title="Code">💻</a></td>
    
  </tr>
</table>

<!-- markdownlint-restore -->
<!-- prettier-ignore-end -->

<!-- ALL-CONTRIBUTORS-LIST:END -->

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/hammer-mt/thumb",
    "name": "thumb",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "",
    "author": "Mike Taylor",
    "author_email": "mike@saxifrage.xyz",
    "download_url": "https://files.pythonhosted.org/packages/0f/6e/a0882b07ad984eb42b7b54bebef91d950a84f8ba205736b11d728d57f374/thumb-0.2.9.tar.gz",
    "platform": null,
    "description": "# thumb\n\nA simple prompt testing library for LLMs.\n\n## Quick start\n\n### 1. Install the library\n\n> `pip install thumb`\n\n### 2. Set up a test\n\n```Python\nimport os\nimport thumb\n\n# Set your API key: https://platform.openai.com/account/api-keys\nos.environ[\"OPENAI_API_KEY\"] = \"YOUR_API_KEY_HERE\"\n\n# set up a prompt templates for the a/b test\nprompt_a = \"tell me a joke\"\nprompt_b = \"tell me a family friendly joke\"\n\n# generate the responses\ntest = thumb.test([prompt_a, prompt_b])\n```\n\n### 3. Rate the responses\n\nEach prompt is run 10 times asynchronously by default, which is around 9x faster than running them sequentially. In Jupyter Notebooks a simple user interface is displayed for blind rating responses (you don't see which prompt generated the response).\n\n![image](/img/thumb.png)\n\nOnce all responses have been rated, the following performance statistics are calculated broken down by prompt template:\n- `avg_score` amount of positive feedback as a percentage of all runs\n- `avg_tokens`: how many tokens were used across the prompt and response\n- `avg_cost`: an estimate of how much the prompt cost to run on average\n\nA simple report is displayed in the notebook, and the full data is saved to a CSV file `thumb/ThumbTest-{TestID}.csv`.\n\n![image](/img/eval.png)\n\n## Functionality\n\n### Test cases\n\nTest cases are when you want to test a prompt template with different input variables. For example, if you want to test a prompt template that includes a variable for a comedian's name, you can set up test cases for different comedians.\n\n```Python\n# set up a prompt templates for the a/b test\nprompt_a = \"tell me a joke in the style of {comedian}\"\nprompt_b = \"tell me a family friendly joke in the style of {comedian}\"\n\n# set test cases with different input variables\ncases = [\n  {\"comedian\": \"chris rock\"}, \n  {\"comedian\": \"ricky gervais\"}, \n  {\"comedian\": \"robin williams\"}\n  ]\n\n# generate the responses\ntest = thumb.test([prompt_a, prompt_b], cases)\n```\n\nEvery test case will be run against every prompt template, so in this example you'll get 6 combinations (3 test cases x 2 prompt templates), which will each run 10 times (60 total calls to OpenAI). Every test case must include a value for each variable in the prompt template.\n\nPrompts may have multiple variables in each test case. For example, if you want to test a prompt template that includes a variable for a comedian's name and a joke topic, you can set up test cases for different comedians and topics.\n\n```Python\n# set up a prompt templates for the a/b test\nprompt_a = \"tell me a joke about {subject} in the style of {comedian}\"\nprompt_b = \"tell me a family friendly joke about {subject} in the style of {comedian}\"\n\n# set test cases with different input variables\ncases = [\n  {\"subject\": \"joe biden\", \"comedian\": \"chris rock\"}, \n  {\"subject\": \"joe biden\", \"comedian\": \"ricky gervais\"}, \n  {\"subject\": \"donald trump\", \"comedian\": \"chris rock\"}, \n  {\"subject\": \"donald trump\", \"comedian\": \"ricky gervais\"}, \n  ]\n\n# generate the responses\ntest = thumb.test([prompt_a, prompt_b], cases)\n```\n\nEvery case is tested against every prompt, in order to get a fair comparison of the performance of each prompt given the same input data. With 4 test cases and 2 prompts, you'll get 8 combinations (4 test cases x 2 prompt templates), which will each run 10 times (80 total calls to OpenAI).\n\n### Model testing\n\n```Python\n# set up a prompt templates for the a/b test\nprompt_a = \"tell me a joke\"\nprompt_b = \"tell me a family friendly joke\"\n\n# generate the responses\ntest = thumb.test([prompt_a, prompt_b], models=[\"gpt-4\", \"gpt-3.5-turbo\"])\n```\n\nThis will run each prompt against each model, in order to get a fair comparison of the performance of each prompt given the same input data. With 2 prompts and 2 models, you'll get 4 combinations (2 prompts x 2 models), which will each run 10 times (40 total calls to OpenAI).\n\n### System messages\n\n```Python\n# set up a prompt templates for the a/b test\nsystem_message = \"You are the comedian {comedian}\"\n\nprompt_a = [system_message, \"tell me a funny joke about {subject}\"]\nprompt_b = [system_message, \"tell me a hillarious joke {subject}\"]\n\ncases = [{\"subject\": \"joe biden\", \"comedian\": \"chris rock\"}, \n         {\"subject\": \"donald trump\", \"comedian\": \"chris rock\"}]\n\n# generate the responses\ntest = thumb.test([prompt_a, prompt_b], cases)\n```\n\nPrompts can be a string or an array of strings. If the prompt is an array, the first string is used as a system message, and the rest of the prompts alternate between Human and Assistant messages (`[system, human, ai, human, ai, ...]`). This is useful for testing prompts that include a system message, or that are using pre-warming (inserting prior messages into the chat to guide the AI towards desired behavior).\n\n```Python\n# set up a prompt templates for the a/b test\nsystem_message = \"You are the comedian {comedian}\"\n\nprompt_a = [system_message, # system\n            \"tell me a funny joke about {subject}\", # human\n            \"Sorry, as an AI language model, I am not capable of humor\", # assistant\n            \"That's fine just try your best\"] # human\nprompt_b = [system_message, # system\n            \"tell me a hillarious joke about {subject}\", # human\n            \"Sorry, as an AI language model, I am not capable of humor\", # assistant\n            \"That's fine just try your best\"] # human\n\ncases = [{\"subject\": \"joe biden\", \"comedian\": \"chris rock\"}, \n         {\"subject\": \"donald trump\", \"comedian\": \"chris rock\"}]\n\n# generate the responses\ntest = thumb.test([prompt_a, prompt_b], cases)\n```\n\n### Evaluation report\n\nWhen the test completes, you get a full evaluation report, broken down by PID, CID, and model, as well as an overall report broken down by all combinations. If you only test one model or one case, these breakdowns will be dropped. The report shows a key at the bottom to see which ID corresponds to which prompt or case.\n\n![image](/img/report.png)\n\n### Parameters\n\nThe `thumb.test` function takes the following parameters:\n\n#### Required\n\n- **prompts**: an array of prompts (strings) to be tested\n\n#### Optional\n\n- **cases**: a dictionary of variables to input into each prompt template (default: `None`)\n- **runs**: the number of responses to generate per prompt and test case (default: `10`)\n- **models**: a list of OpenAI models you want to generate responses from (default: [`gpt-3.5-turbo`])\n- **async_generate**: a boolean that denotes whether to run async or sequentially (default: `True`)\n\nIf you have 10 test runs with 2 prompt templates and 3 test cases, that's `10 x 2 x 3 = 60` calls to OpenAI. Be careful: particularly with GPT-4 the costs can add up quickly!\n\nLangchain tracing to [LangSmith](https://smith.langchain.com/) is automatically enabled if the `LANGCHAIN_API_KEY` is set as an environment variable (optional).\n\n### Loading and adding\n\nthe `.test()` function returns a `ThumbTest` object. You can add more prompts or cases to the test, or run it additional times. You can also generate, evaluate and export the test data at any time.\n\n```Python\n# set up a prompt templates for the a/b test\nprompt_a = \"tell me a joke\"\nprompt_b = \"tell me a family friendly joke\"\n\n# generate the responses\ntest = thumb.test([prompt_a, prompt_b])\n\n# add more prompts\ntest.add_prompts([\"tell me a knock knock joke\", \"tell me a knock knock joke about {subject}\"])\n\n# add more cases\ntest.add_cases([{\"subject\": \"joe biden\"}, {\"subject\": \"donald trump\"}])\n\n# run each prompt and case 5 more times\ntest.add_runs(5)\n\n# generate the responses\ntest.generate()\n\n# rate the responses\ntest.evaluate()\n\n# export the test data for analysis\ntest.export_to_csv()\n```\n\nEvery prompt template gets the same input data from every test case, but the prompt does not need to use all of the variables in the test case. As in the example above, the `tell me a knock knock joke` prompt does not use the `subject` variable, but it is still generated once (with no variables) for each test case.\n\nTest data is cached in a local JSON file `thumb/.cache/{TestID}.json` after every set of runs is generated for a prompt and case combination.\nIf your test is interrupted, or you want to add to it, you can use the `thumb.load` function to load the test data from the cache.\n\n```Python\n# load a previous test\ntest_id = \"abcd1234\" # replace with your test id\ntest = thumb.load(f\"thumb/.cache/{test_id}.json\")\n\n# run each prompt and case 2 more times\ntest.add_runs(2)\n\n# generate the responses\ntest.generate()\n\n# rate the responses\ntest.evaluate()\n\n# export the test data for analysis\ntest.export_to_csv()\n```\nEvery run for each combination of prompt and case is stored in the object (and cache), and therefore calling `test.generate()` again will not generate any new responses if more prompts, cases, or runs aren't added. Similarly, calling `test.evaluate()` again will not re-rate the responses you have already rated, and will simply redisplay the results if the test has ended.\n\n## Thumb Testing \ud83d\udc4d\ud83e\uddea\n\nThe difference between people just playing around with ChatGPT and those [using AI in production](https://huyenchip.com/2023/04/11/llm-engineering.html) is evaluation. LLMs respond non-deterministically, and so it's important to test what results look like when scaled up across a wide range of scenarios. Without an evaluation framework you're left blindly guessing about what's working in your prompts (or not).\n\nSerious [prompt engineers](https://www.saxifrage.xyz/post/prompt-engineering) are testing and learning which inputs lead to useful or desired outputs, reliably and at scale. This process is called [prompt optimization](https://www.saxifrage.xyz/post/prompt-optimization), and it looks like this:\n\n1. Metrics \u2013 Establish how you'll measure the performance of the responses from the AI.\n2. Hypothesis \u2013 Design one or more prompts that may work, based on the latest research.\n3. Testing \u2013 Generate responses for your different prompts against multiple test cases.\n4. Analysis \u2013 Evaluate the performance of your prompts and use them to inform the next test.\n\nThumb testing fills the gap between large scale professional evaluation mechanisms, and [blindly prompting](https://mitchellh.com/writing/prompt-engineering-vs-blind-prompting) through trial and error. If you are transitioning a prompt into a production environment, using `thumb` to test your prompt can help you catch edge cases, and get early user or team feedback on the results.\n\n## Contributors\n\nThese people are building `thumb` for fun in their spare time. \ud83d\udc4d\n\n<!-- ALL-CONTRIBUTORS-LIST:START - Do not remove or modify this section -->\n<!-- prettier-ignore-start -->\n<!-- markdownlint-disable -->\n<table>\n  <tr>\n    <td align=\"center\"><a href=\"https://twitter.com/hammer_mt\"><img src=\"https://avatars.githubusercontent.com/u/5264596?s=96&v=4\" width=\"100px;\" alt=\"\"/><br /><sub><b>hammer-mt</b></sub></a><br /><a href=\"https://github.com/hammer-mt/thumb/commits?author=hammer-mt\" title=\"Code\">\ud83d\udcbb</a></td>\n    \n  </tr>\n</table>\n\n<!-- markdownlint-restore -->\n<!-- prettier-ignore-end -->\n\n<!-- ALL-CONTRIBUTORS-LIST:END -->\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A simple prompt testing library for LLMs.",
    "version": "0.2.9",
    "project_urls": {
        "Homepage": "https://github.com/hammer-mt/thumb"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e9cfdce9a196f2657d919f82edd0207da4489c117ae35523a354ea36849b90d8",
                "md5": "a94198926e16fc14149eef601159bf90",
                "sha256": "8bd8e31e57555448fd010ec7f38c7ce595fa3ed7583a3a7cb4e9620416dbba95"
            },
            "downloads": -1,
            "filename": "thumb-0.2.9-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a94198926e16fc14149eef601159bf90",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 17880,
            "upload_time": "2023-12-12T14:35:03",
            "upload_time_iso_8601": "2023-12-12T14:35:03.364301Z",
            "url": "https://files.pythonhosted.org/packages/e9/cf/dce9a196f2657d919f82edd0207da4489c117ae35523a354ea36849b90d8/thumb-0.2.9-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0f6ea0882b07ad984eb42b7b54bebef91d950a84f8ba205736b11d728d57f374",
                "md5": "5695d104a9f3ba1f52616e69826d214e",
                "sha256": "ef888cd035527610606b889d11e5fff6e7fc4c5962e041544458d5503ce4ee05"
            },
            "downloads": -1,
            "filename": "thumb-0.2.9.tar.gz",
            "has_sig": false,
            "md5_digest": "5695d104a9f3ba1f52616e69826d214e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 20714,
            "upload_time": "2023-12-12T14:35:04",
            "upload_time_iso_8601": "2023-12-12T14:35:04.763081Z",
            "url": "https://files.pythonhosted.org/packages/0f/6e/a0882b07ad984eb42b7b54bebef91d950a84f8ba205736b11d728d57f374/thumb-0.2.9.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-12-12 14:35:04",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "hammer-mt",
    "github_project": "thumb",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "thumb"
}

Mike Taylor