Name | nutcracker-py JSON |
Version |
0.0.2a2
JSON |
| download |
home_page | None |
Summary | streamline LLM evaluation |
upload_time | 2024-08-03 10:09:01 |
maintainer | None |
docs_url | None |
author | Bruce W. Lee |
requires_python | None |
license | None |
keywords |
evaluation
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# Nutcracker - Large Model Evaluation
like LM-Eval-Harness but without PyTorch madness. use this to evaluate LLMs served through APIs
https://github.com/brucewlee/nutcracker/assets/54278520/151403fc-217c-486c-8de6-489af25789ce
---
# Installation
### Route 1. PyPI
**Install Nutcracker**
```bash
pip install nutcracker-py
```
**Download Nutcracker DB**
```bash
git clone https://github.com/evaluation-tools/nutcracker-db
```
### Route 2. GitHub
**Install Nutcracker**
```bash
git clone https://github.com/evaluation-tools/nutcracker
pip install -e nutcracker
```
**Download Nutcracker DB**
```bash
git clone https://github.com/evaluation-tools/nutcracker-db
```
Check all tasks implemented in [Nutcracker DB](https://github.com/evaluation-tools/nutcracker-db)'s readme page.
---
# QuickStart
### Case Study: Evaluate (Any) LLM API on TruthfulQA ([Script](nutcracker/demos/demo-readme1.py))
##### STEP 1: Define Model
- Define a simple model class with a "*respond(self, user_prompt)*" function.
- We will use OpenAI here. But really, any api can be evaluated if the "*respond(self, user_prompt)*" function that returns LLM response in string exists. Get creative (Hugginface API, Anthropic API, Replicate API, OLLaMA, and etc.)
```python
from openai import OpenAI
import os, logging, sys
logging.basicConfig(level=logging.INFO)
logging.getLogger('httpx').setLevel(logging.CRITICAL)
os.environ["OPENAI_API_KEY"] = ""
client = OpenAI()
class ChatGPT:
def __init__(self):
self.model = "gpt-3.5-turbo"
def respond(self, user_prompt):
response_data = None
while response_data is None:
try:
completion = client.chat.completions.create(
model=self.model,
messages=[
{"role": "user", "content": f"{user_prompt}"}
],
timeout=15,
)
response_data = completion.choices[0].message.content
break
except KeyboardInterrupt:
sys.exit()
except:
print("Request timed out, retrying...")
return response_data
```
##### STEP 2: Run Evaluation
```python
from nutcracker.data import Task, Pile
from nutcracker.runs import Schema
from nutcracker.evaluator import MCQEvaluator, generate_report
# this db_directory value should work off-the-shelf if you cloned both repositories in the same directory
truthfulqa = Task.load_from_db(task_name='truthfulqa-mc1', db_directory='nutcracker-db/db')
# sample 20 for demo
truthfulqa.sample(20, in_place=True)
# running this experiment updates each instance's model_response property in truthfulqa data object with ChatGPT responses
experiment = Schema(model=ChatGPT(), data=truthfulqa)
experiment.run()
# running this evaluation updates each instance's response_correct property in truthfulqa data object with evaluations
evaluation = MCQEvaluator(data=truthfulqa)
evaluation.run()
for i in range (0, len(truthfulqa)):
print(truthfulqa[i].user_prompt)
print(truthfulqa[i].model_response)
print(truthfulqa[i].correct_options)
print(truthfulqa[i].response_correct)
print()
print(generate_report(truthfulqa, save_path='accuracy_report.txt'))
```
---
### Case Study: Task vs. Pile? Evaluating LLaMA on MMLU ([Script](nutcracker/demos/demo-readme2.py))
##### STEP 1: Understand the basis of Nutcracker
- Despite our lengthy history of model evaluation, my understanding of the field is that we have not reached a clear consensus on what a "benchmark" is (*Is MMLU a "benchmark"? Is Huggingface Open LLM leaderboard a "benchmark"?*).
- Instead of using the word benchmark, Nutcracker divides the data structure into Instance, Task, and Pile (See blog post: [HERE](https://brucewlee.medium.com/nutcracker-instance-task-pile-38f646c1b36d))
- Nutcracker DB is constructed on the Task-level but you can call multiple Tasks together on the Pile-level.
<p align="center">
<img src="resources/w_2100.png" width="400"/>
</p>
##### STEP 2: Define Model
- Since we've tried OpenAI API above, let's now try Hugginface Inference Endpoint. Most open-source models are accessible through this option. (See blog post: [HERE](https://brucewlee.medium.com/nutcracker-evaluating-on-huggingface-inference-endpoints-6e977e326c5b))
```python
class LLaMA:
def __init__(self):
self.API_URL = "https://xxxx.us-east-1.aws.endpoints.huggingface.cloud"
def query(self, payload):
headers = {
"Accept" : "application/json",
"Authorization": "Bearer hf_XXXXX",
"Content-Type": "application/json"
}
response = requests.post(self.API_URL, headers=headers, json=payload)
return response.json()
def respond(self, user_prompt):
output = self.query({
"inputs": f"<s>[INST] <<SYS>> You are a helpful assistant. You keep your answers short. <</SYS>> {user_prompt}",
})
return output[0]['generated_text']
```
##### STEP 3: Load Data
```python
from nutcracker.data import Pile
import logging
logging.basicConfig(level=logging.INFO)
mmlu = Pile.load_from_db('mmlu','nutcracker-db/db')
```
##### STEP 4: Run Experiment (Retrieve Model Responses)
- Running evaluation updates each instance's *model_response* attribute within the data object, which is mmlu Pile in this case.
- You can save data object at any step of the evaluation. Let's try saving this time to prevent API requesting again in case anything happens.
```python
from nutcracker.runs import Schema
mmlu.sample(n=1000, in_place = True)
experiment = Schema(model=LLaMA(), data=mmlu)
experiment.run()
mmlu.save_to_file('mmlu-llama.pkl')
```
- You can load and check how the model responded.
```python
loaded_mmlu = Pile.load_from_file('mmlu-llama.pkl')
for i in range (0,len(loaded_mmlu)):
print("\n\n\n---\n")
print("Prompt:")
print(loaded_mmlu[i].user_prompt)
print("\nResponses:")
print(loaded_mmlu[i].model_response)
```
##### STEP 5: Run Evaluation
- LLMs often don’t respond in immediately recognizable letters like A, B, C, or D.
- Therefore, Nutcracker supports an intent-matching feature (requires OpenAI API Key) that parses model response to match discrete labels, but let’s disable that for now and proceed with our evaluation.
- We recommend using intent-matching for almost all use cases. We will publish a detailed research later.
```python
from nutcracker.evaluator import MCQEvaluator, generate_report
evaluation = MCQEvaluator(data=loaded_mmlu, disable_intent_matching=True)
evaluation.run()
print(generate_report(loaded_mmlu, save_path='accuracy_report.txt'))
```
https://github.com/brucewlee/nutcracker/assets/54278520/6deb5362-fd48-470e-9964-c794425811d9
---
# Tutorials
- Evaluating on HuggingFace Inference Endpoints -> [HERE / Medium](https://brucewlee.medium.com/nutcracker-evaluating-on-huggingface-inference-endpoints-6e977e326c5b)
- Understanding Instance-Task-Pile -> [HERE / Medium](https://brucewlee.medium.com/nutcracker-instance-task-pile-38f646c1b36d)
Raw data
{
"_id": null,
"home_page": null,
"name": "nutcracker-py",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "Evaluation",
"author": "Bruce W. Lee",
"author_email": "bruce@walnutresearch.com",
"download_url": "https://files.pythonhosted.org/packages/bd/10/0f95381f9528fa9c07f7c9aa3b15ddd3d316d7adf1fa44f1a863e5cb9c21/nutcracker_py-0.0.2a2.tar.gz",
"platform": null,
"description": "# Nutcracker - Large Model Evaluation\n\nlike LM-Eval-Harness but without PyTorch madness. use this to evaluate LLMs served through APIs\n\nhttps://github.com/brucewlee/nutcracker/assets/54278520/151403fc-217c-486c-8de6-489af25789ce\n\n\n\n---\n\n# Installation\n\n### Route 1. PyPI\n**Install Nutcracker**\n```bash\npip install nutcracker-py\n```\n\n**Download Nutcracker DB**\n```bash\ngit clone https://github.com/evaluation-tools/nutcracker-db\n```\n\n### Route 2. GitHub\n**Install Nutcracker** \n```bash\ngit clone https://github.com/evaluation-tools/nutcracker\npip install -e nutcracker\n```\n\n**Download Nutcracker DB**\n```bash\ngit clone https://github.com/evaluation-tools/nutcracker-db\n```\n\nCheck all tasks implemented in [Nutcracker DB](https://github.com/evaluation-tools/nutcracker-db)'s readme page.\n\n---\n\n# QuickStart\n### Case Study: Evaluate (Any) LLM API on TruthfulQA ([Script](nutcracker/demos/demo-readme1.py))\n##### STEP 1: Define Model\n- Define a simple model class with a \"*respond(self, user_prompt)*\" function. \n- We will use OpenAI here. But really, any api can be evaluated if the \"*respond(self, user_prompt)*\" function that returns LLM response in string exists. Get creative (Hugginface API, Anthropic API, Replicate API, OLLaMA, and etc.)\n```python\nfrom openai import OpenAI\nimport os, logging, sys\nlogging.basicConfig(level=logging.INFO)\nlogging.getLogger('httpx').setLevel(logging.CRITICAL)\nos.environ[\"OPENAI_API_KEY\"] = \"\"\nclient = OpenAI()\n\nclass ChatGPT:\n def __init__(self):\n self.model = \"gpt-3.5-turbo\"\n\n def respond(self, user_prompt):\n response_data = None\n while response_data is None:\n try:\n completion = client.chat.completions.create(\n model=self.model,\n messages=[\n {\"role\": \"user\", \"content\": f\"{user_prompt}\"}\n ],\n timeout=15,\n )\n response_data = completion.choices[0].message.content\n break\n except KeyboardInterrupt:\n sys.exit()\n except:\n print(\"Request timed out, retrying...\")\n return response_data\n```\n##### STEP 2: Run Evaluation\n```python\nfrom nutcracker.data import Task, Pile\nfrom nutcracker.runs import Schema\nfrom nutcracker.evaluator import MCQEvaluator, generate_report\n\n# this db_directory value should work off-the-shelf if you cloned both repositories in the same directory\ntruthfulqa = Task.load_from_db(task_name='truthfulqa-mc1', db_directory='nutcracker-db/db')\n\n# sample 20 for demo\ntruthfulqa.sample(20, in_place=True)\n\n# running this experiment updates each instance's model_response property in truthfulqa data object with ChatGPT responses\nexperiment = Schema(model=ChatGPT(), data=truthfulqa)\nexperiment.run()\n\n# running this evaluation updates each instance's response_correct property in truthfulqa data object with evaluations\nevaluation = MCQEvaluator(data=truthfulqa)\nevaluation.run()\n\nfor i in range (0, len(truthfulqa)):\n print(truthfulqa[i].user_prompt)\n print(truthfulqa[i].model_response)\n print(truthfulqa[i].correct_options)\n print(truthfulqa[i].response_correct)\n print()\n\nprint(generate_report(truthfulqa, save_path='accuracy_report.txt'))\n```\n\n---\n\n### Case Study: Task vs. Pile? Evaluating LLaMA on MMLU ([Script](nutcracker/demos/demo-readme2.py))\n##### STEP 1: Understand the basis of Nutcracker\n- Despite our lengthy history of model evaluation, my understanding of the field is that we have not reached a clear consensus on what a \"benchmark\" is (*Is MMLU a \"benchmark\"? Is Huggingface Open LLM leaderboard a \"benchmark\"?*). \n- Instead of using the word benchmark, Nutcracker divides the data structure into Instance, Task, and Pile (See blog post: [HERE](https://brucewlee.medium.com/nutcracker-instance-task-pile-38f646c1b36d))\n- Nutcracker DB is constructed on the Task-level but you can call multiple Tasks together on the Pile-level.\n<p align=\"center\">\n<img src=\"resources/w_2100.png\" width=\"400\"/>\n</p>\n\n##### STEP 2: Define Model\n- Since we've tried OpenAI API above, let's now try Hugginface Inference Endpoint. Most open-source models are accessible through this option. (See blog post: [HERE](https://brucewlee.medium.com/nutcracker-evaluating-on-huggingface-inference-endpoints-6e977e326c5b))\n\n```python\nclass LLaMA:\n def __init__(self):\n self.API_URL = \"https://xxxx.us-east-1.aws.endpoints.huggingface.cloud\"\n\n def query(self, payload):\n headers = {\n \"Accept\" : \"application/json\",\n \"Authorization\": \"Bearer hf_XXXXX\",\n \"Content-Type\": \"application/json\" \n }\n response = requests.post(self.API_URL, headers=headers, json=payload)\n return response.json()\n\n def respond(self, user_prompt):\n output = self.query({\n \"inputs\": f\"<s>[INST] <<SYS>> You are a helpful assistant. You keep your answers short. <</SYS>> {user_prompt}\",\n })\n return output[0]['generated_text']\n```\n\n##### STEP 3: Load Data\n```python\nfrom nutcracker.data import Pile\nimport logging\nlogging.basicConfig(level=logging.INFO)\n\nmmlu = Pile.load_from_db('mmlu','nutcracker-db/db')\n```\n\n##### STEP 4: Run Experiment (Retrieve Model Responses)\n- Running evaluation updates each instance's *model_response* attribute within the data object, which is mmlu Pile in this case.\n- You can save data object at any step of the evaluation. Let's try saving this time to prevent API requesting again in case anything happens.\n\n```python\nfrom nutcracker.runs import Schema\nmmlu.sample(n=1000, in_place = True)\n\nexperiment = Schema(model=LLaMA(), data=mmlu)\nexperiment.run()\nmmlu.save_to_file('mmlu-llama.pkl')\n```\n- You can load and check how the model responded.\n\n```python\nloaded_mmlu = Pile.load_from_file('mmlu-llama.pkl')\nfor i in range (0,len(loaded_mmlu)):\n print(\"\\n\\n\\n---\\n\")\n print(\"Prompt:\")\n print(loaded_mmlu[i].user_prompt)\n print(\"\\nResponses:\")\n print(loaded_mmlu[i].model_response)\n```\n\n##### STEP 5: Run Evaluation\n- LLMs often don\u2019t respond in immediately recognizable letters like A, B, C, or D. \n- Therefore, Nutcracker supports an intent-matching feature (requires OpenAI API Key) that parses model response to match discrete labels, but let\u2019s disable that for now and proceed with our evaluation.\n- We recommend using intent-matching for almost all use cases. We will publish a detailed research later.\n\n```python\nfrom nutcracker.evaluator import MCQEvaluator, generate_report\nevaluation = MCQEvaluator(data=loaded_mmlu, disable_intent_matching=True)\nevaluation.run()\nprint(generate_report(loaded_mmlu, save_path='accuracy_report.txt'))\n```\n\n\nhttps://github.com/brucewlee/nutcracker/assets/54278520/6deb5362-fd48-470e-9964-c794425811d9\n\n\n\n\n---\n\n# Tutorials\n- Evaluating on HuggingFace Inference Endpoints -> [HERE / Medium](https://brucewlee.medium.com/nutcracker-evaluating-on-huggingface-inference-endpoints-6e977e326c5b)\n- Understanding Instance-Task-Pile -> [HERE / Medium](https://brucewlee.medium.com/nutcracker-instance-task-pile-38f646c1b36d)\n",
"bugtrack_url": null,
"license": null,
"summary": "streamline LLM evaluation",
"version": "0.0.2a2",
"project_urls": null,
"split_keywords": [
"evaluation"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "877d8e115ddfa5623bd03f10f5313990c834e52451ed1c5db48188de81e8acaf",
"md5": "3a29bd5a5ef15ee9b99a145fbbfa4e23",
"sha256": "3e3cedd73d423ccc499e462cc810f904679f41d7eaa22e658b39b86d6d019b9e"
},
"downloads": -1,
"filename": "nutcracker_py-0.0.2a2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "3a29bd5a5ef15ee9b99a145fbbfa4e23",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 55119,
"upload_time": "2024-08-03T10:08:59",
"upload_time_iso_8601": "2024-08-03T10:08:59.554359Z",
"url": "https://files.pythonhosted.org/packages/87/7d/8e115ddfa5623bd03f10f5313990c834e52451ed1c5db48188de81e8acaf/nutcracker_py-0.0.2a2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "bd100f95381f9528fa9c07f7c9aa3b15ddd3d316d7adf1fa44f1a863e5cb9c21",
"md5": "683e7a3f377f62eeda6ed1d6b92bd239",
"sha256": "74faf99f0d19500288a9a646e65fcde4a7e94c0c4f7fb0965fb3b371ef7e9d1e"
},
"downloads": -1,
"filename": "nutcracker_py-0.0.2a2.tar.gz",
"has_sig": false,
"md5_digest": "683e7a3f377f62eeda6ed1d6b92bd239",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 26553,
"upload_time": "2024-08-03T10:09:01",
"upload_time_iso_8601": "2024-08-03T10:09:01.259122Z",
"url": "https://files.pythonhosted.org/packages/bd/10/0f95381f9528fa9c07f7c9aa3b15ddd3d316d7adf1fa44f1a863e5cb9c21/nutcracker_py-0.0.2a2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-08-03 10:09:01",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "nutcracker-py"
}