nutcracker-py

Name	nutcracker-py JSON
Version	0.0.2a2 JSON
	download
home_page	None
Summary	streamline LLM evaluation
upload_time	2024-08-03 10:09:01
maintainer	None
docs_url	None
author	Bruce W. Lee
requires_python	None
license	None
keywords	evaluation
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Nutcracker - Large Model Evaluation

like LM-Eval-Harness but without PyTorch madness. use this to evaluate LLMs served through APIs

https://github.com/brucewlee/nutcracker/assets/54278520/151403fc-217c-486c-8de6-489af25789ce



---

# Installation

### Route 1. PyPI
**Install Nutcracker**
```bash
pip install nutcracker-py
```

**Download Nutcracker DB**
```bash
git clone https://github.com/evaluation-tools/nutcracker-db
```

### Route 2. GitHub
**Install Nutcracker**         
```bash
git clone https://github.com/evaluation-tools/nutcracker
pip install -e nutcracker
```

**Download Nutcracker DB**
```bash
git clone https://github.com/evaluation-tools/nutcracker-db
```

Check all tasks implemented in [Nutcracker DB](https://github.com/evaluation-tools/nutcracker-db)'s readme page.

---

# QuickStart
### Case Study: Evaluate (Any) LLM API on TruthfulQA ([Script](nutcracker/demos/demo-readme1.py))
##### STEP 1: Define Model
- Define a simple model class with a "*respond(self, user_prompt)*" function. 
- We will use OpenAI here. But really, any api can be evaluated if the "*respond(self, user_prompt)*" function that returns LLM response in string exists. Get creative (Hugginface API, Anthropic API, Replicate API, OLLaMA, and etc.)
```python
from openai import OpenAI
import os, logging, sys
logging.basicConfig(level=logging.INFO)
logging.getLogger('httpx').setLevel(logging.CRITICAL)
os.environ["OPENAI_API_KEY"] = ""
client = OpenAI()

class ChatGPT:
    def __init__(self):
        self.model = "gpt-3.5-turbo"

    def respond(self, user_prompt):
        response_data = None
        while response_data is None:
            try:
                completion = client.chat.completions.create(
                    model=self.model,
                    messages=[
                        {"role": "user", "content": f"{user_prompt}"}
                    ],
                    timeout=15,
                )
                response_data = completion.choices[0].message.content
                break
            except KeyboardInterrupt:
                sys.exit()
            except:
                print("Request timed out, retrying...")
        return response_data
```
##### STEP 2: Run Evaluation
```python
from nutcracker.data import Task, Pile
from nutcracker.runs import Schema
from nutcracker.evaluator import MCQEvaluator, generate_report

# this db_directory value should work off-the-shelf if you cloned both repositories in the same directory
truthfulqa = Task.load_from_db(task_name='truthfulqa-mc1', db_directory='nutcracker-db/db')

# sample 20 for demo
truthfulqa.sample(20, in_place=True)

# running this experiment updates each instance's model_response property in truthfulqa data object with ChatGPT responses
experiment = Schema(model=ChatGPT(), data=truthfulqa)
experiment.run()

# running this evaluation updates each instance's response_correct property in truthfulqa data object with evaluations
evaluation = MCQEvaluator(data=truthfulqa)
evaluation.run()

for i in range (0, len(truthfulqa)):
    print(truthfulqa[i].user_prompt)
    print(truthfulqa[i].model_response)
    print(truthfulqa[i].correct_options)
    print(truthfulqa[i].response_correct)
    print()

print(generate_report(truthfulqa, save_path='accuracy_report.txt'))
```

---

### Case Study: Task vs. Pile? Evaluating LLaMA on MMLU ([Script](nutcracker/demos/demo-readme2.py))
##### STEP 1: Understand the basis of Nutcracker
- Despite our lengthy history of model evaluation, my understanding of the field is that we have not reached a clear consensus on what a "benchmark" is (*Is MMLU a "benchmark"? Is Huggingface Open LLM leaderboard a "benchmark"?*). 
- Instead of using the word benchmark, Nutcracker divides the data structure into Instance, Task, and Pile (See blog post: [HERE](https://brucewlee.medium.com/nutcracker-instance-task-pile-38f646c1b36d))
- Nutcracker DB is constructed on the Task-level but you can call multiple Tasks together on the Pile-level.
<p align="center">
<img src="resources/w_2100.png" width="400"/>
</p>

##### STEP 2: Define Model
- Since we've tried OpenAI API above, let's now try Hugginface Inference Endpoint. Most open-source models are accessible through this option. (See blog post: [HERE](https://brucewlee.medium.com/nutcracker-evaluating-on-huggingface-inference-endpoints-6e977e326c5b))

```python
class LLaMA:
    def __init__(self):
        self.API_URL = "https://xxxx.us-east-1.aws.endpoints.huggingface.cloud"

    def query(self, payload):
        headers = {
            "Accept" : "application/json",
            "Authorization": "Bearer hf_XXXXX",
            "Content-Type": "application/json" 
        }
        response = requests.post(self.API_URL, headers=headers, json=payload)
        return response.json()

    def respond(self, user_prompt):
        output = self.query({
            "inputs": f"<s>[INST] <<SYS>> You are a helpful assistant. You keep your answers short. <</SYS>> {user_prompt}",
        })
        return output[0]['generated_text']
```

##### STEP 3: Load Data
```python
from nutcracker.data import Pile
import logging
logging.basicConfig(level=logging.INFO)

mmlu = Pile.load_from_db('mmlu','nutcracker-db/db')
```

##### STEP 4: Run Experiment (Retrieve Model Responses)
- Running evaluation updates each instance's *model_response* attribute within the data object, which is mmlu Pile in this case.
- You can save data object at any step of the evaluation. Let's try saving this time to prevent API requesting again in case anything happens.

```python
from nutcracker.runs import Schema
mmlu.sample(n=1000, in_place = True)

experiment = Schema(model=LLaMA(), data=mmlu)
experiment.run()
mmlu.save_to_file('mmlu-llama.pkl')
```
- You can load and check how the model responded.

```python
loaded_mmlu = Pile.load_from_file('mmlu-llama.pkl')
for i in range (0,len(loaded_mmlu)):
    print("\n\n\n---\n")
    print("Prompt:")
    print(loaded_mmlu[i].user_prompt)
    print("\nResponses:")
    print(loaded_mmlu[i].model_response)
```

##### STEP 5: Run Evaluation
- LLMs often don’t respond in immediately recognizable letters like A, B, C, or D. 
- Therefore, Nutcracker supports an intent-matching feature (requires OpenAI API Key) that parses model response to match discrete labels, but let’s disable that for now and proceed with our evaluation.
- We recommend using intent-matching for almost all use cases. We will publish a detailed research later.

```python
from nutcracker.evaluator import MCQEvaluator, generate_report
evaluation = MCQEvaluator(data=loaded_mmlu, disable_intent_matching=True)
evaluation.run()
print(generate_report(loaded_mmlu, save_path='accuracy_report.txt'))
```


https://github.com/brucewlee/nutcracker/assets/54278520/6deb5362-fd48-470e-9964-c794425811d9




---

# Tutorials
- Evaluating on HuggingFace Inference Endpoints -> [HERE / Medium](https://brucewlee.medium.com/nutcracker-evaluating-on-huggingface-inference-endpoints-6e977e326c5b)
- Understanding Instance-Task-Pile -> [HERE / Medium](https://brucewlee.medium.com/nutcracker-instance-task-pile-38f646c1b36d)

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "nutcracker-py",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "Evaluation",
    "author": "Bruce W. Lee",
    "author_email": "bruce@walnutresearch.com",
    "download_url": "https://files.pythonhosted.org/packages/bd/10/0f95381f9528fa9c07f7c9aa3b15ddd3d316d7adf1fa44f1a863e5cb9c21/nutcracker_py-0.0.2a2.tar.gz",
    "platform": null,
    "description": "# Nutcracker - Large Model Evaluation\n\nlike LM-Eval-Harness but without PyTorch madness. use this to evaluate LLMs served through APIs\n\nhttps://github.com/brucewlee/nutcracker/assets/54278520/151403fc-217c-486c-8de6-489af25789ce\n\n\n\n---\n\n# Installation\n\n### Route 1. PyPI\n**Install Nutcracker**\n```bash\npip install nutcracker-py\n```\n\n**Download Nutcracker DB**\n```bash\ngit clone https://github.com/evaluation-tools/nutcracker-db\n```\n\n### Route 2. GitHub\n**Install Nutcracker**         \n```bash\ngit clone https://github.com/evaluation-tools/nutcracker\npip install -e nutcracker\n```\n\n**Download Nutcracker DB**\n```bash\ngit clone https://github.com/evaluation-tools/nutcracker-db\n```\n\nCheck all tasks implemented in [Nutcracker DB](https://github.com/evaluation-tools/nutcracker-db)'s readme page.\n\n---\n\n# QuickStart\n### Case Study: Evaluate (Any) LLM API on TruthfulQA ([Script](nutcracker/demos/demo-readme1.py))\n##### STEP 1: Define Model\n- Define a simple model class with a \"*respond(self, user_prompt)*\" function. \n- We will use OpenAI here. But really, any api can be evaluated if the \"*respond(self, user_prompt)*\" function that returns LLM response in string exists. Get creative (Hugginface API, Anthropic API, Replicate API, OLLaMA, and etc.)\n```python\nfrom openai import OpenAI\nimport os, logging, sys\nlogging.basicConfig(level=logging.INFO)\nlogging.getLogger('httpx').setLevel(logging.CRITICAL)\nos.environ[\"OPENAI_API_KEY\"] = \"\"\nclient = OpenAI()\n\nclass ChatGPT:\n    def __init__(self):\n        self.model = \"gpt-3.5-turbo\"\n\n    def respond(self, user_prompt):\n        response_data = None\n        while response_data is None:\n            try:\n                completion = client.chat.completions.create(\n                    model=self.model,\n                    messages=[\n                        {\"role\": \"user\", \"content\": f\"{user_prompt}\"}\n                    ],\n                    timeout=15,\n                )\n                response_data = completion.choices[0].message.content\n                break\n            except KeyboardInterrupt:\n                sys.exit()\n            except:\n                print(\"Request timed out, retrying...\")\n        return response_data\n```\n##### STEP 2: Run Evaluation\n```python\nfrom nutcracker.data import Task, Pile\nfrom nutcracker.runs import Schema\nfrom nutcracker.evaluator import MCQEvaluator, generate_report\n\n# this db_directory value should work off-the-shelf if you cloned both repositories in the same directory\ntruthfulqa = Task.load_from_db(task_name='truthfulqa-mc1', db_directory='nutcracker-db/db')\n\n# sample 20 for demo\ntruthfulqa.sample(20, in_place=True)\n\n# running this experiment updates each instance's model_response property in truthfulqa data object with ChatGPT responses\nexperiment = Schema(model=ChatGPT(), data=truthfulqa)\nexperiment.run()\n\n# running this evaluation updates each instance's response_correct property in truthfulqa data object with evaluations\nevaluation = MCQEvaluator(data=truthfulqa)\nevaluation.run()\n\nfor i in range (0, len(truthfulqa)):\n    print(truthfulqa[i].user_prompt)\n    print(truthfulqa[i].model_response)\n    print(truthfulqa[i].correct_options)\n    print(truthfulqa[i].response_correct)\n    print()\n\nprint(generate_report(truthfulqa, save_path='accuracy_report.txt'))\n```\n\n---\n\n### Case Study: Task vs. Pile? Evaluating LLaMA on MMLU ([Script](nutcracker/demos/demo-readme2.py))\n##### STEP 1: Understand the basis of Nutcracker\n- Despite our lengthy history of model evaluation, my understanding of the field is that we have not reached a clear consensus on what a \"benchmark\" is (*Is MMLU a \"benchmark\"? Is Huggingface Open LLM leaderboard a \"benchmark\"?*). \n- Instead of using the word benchmark, Nutcracker divides the data structure into Instance, Task, and Pile (See blog post: [HERE](https://brucewlee.medium.com/nutcracker-instance-task-pile-38f646c1b36d))\n- Nutcracker DB is constructed on the Task-level but you can call multiple Tasks together on the Pile-level.\n<p align=\"center\">\n<img src=\"resources/w_2100.png\" width=\"400\"/>\n</p>\n\n##### STEP 2: Define Model\n- Since we've tried OpenAI API above, let's now try Hugginface Inference Endpoint. Most open-source models are accessible through this option. (See blog post: [HERE](https://brucewlee.medium.com/nutcracker-evaluating-on-huggingface-inference-endpoints-6e977e326c5b))\n\n```python\nclass LLaMA:\n    def __init__(self):\n        self.API_URL = \"https://xxxx.us-east-1.aws.endpoints.huggingface.cloud\"\n\n    def query(self, payload):\n        headers = {\n            \"Accept\" : \"application/json\",\n            \"Authorization\": \"Bearer hf_XXXXX\",\n            \"Content-Type\": \"application/json\" \n        }\n        response = requests.post(self.API_URL, headers=headers, json=payload)\n        return response.json()\n\n    def respond(self, user_prompt):\n        output = self.query({\n            \"inputs\": f\"<s>[INST] <<SYS>> You are a helpful assistant. You keep your answers short. <</SYS>> {user_prompt}\",\n        })\n        return output[0]['generated_text']\n```\n\n##### STEP 3: Load Data\n```python\nfrom nutcracker.data import Pile\nimport logging\nlogging.basicConfig(level=logging.INFO)\n\nmmlu = Pile.load_from_db('mmlu','nutcracker-db/db')\n```\n\n##### STEP 4: Run Experiment (Retrieve Model Responses)\n- Running evaluation updates each instance's *model_response* attribute within the data object, which is mmlu Pile in this case.\n- You can save data object at any step of the evaluation. Let's try saving this time to prevent API requesting again in case anything happens.\n\n```python\nfrom nutcracker.runs import Schema\nmmlu.sample(n=1000, in_place = True)\n\nexperiment = Schema(model=LLaMA(), data=mmlu)\nexperiment.run()\nmmlu.save_to_file('mmlu-llama.pkl')\n```\n- You can load and check how the model responded.\n\n```python\nloaded_mmlu = Pile.load_from_file('mmlu-llama.pkl')\nfor i in range (0,len(loaded_mmlu)):\n    print(\"\\n\\n\\n---\\n\")\n    print(\"Prompt:\")\n    print(loaded_mmlu[i].user_prompt)\n    print(\"\\nResponses:\")\n    print(loaded_mmlu[i].model_response)\n```\n\n##### STEP 5: Run Evaluation\n- LLMs often don\u2019t respond in immediately recognizable letters like A, B, C, or D. \n- Therefore, Nutcracker supports an intent-matching feature (requires OpenAI API Key) that parses model response to match discrete labels, but let\u2019s disable that for now and proceed with our evaluation.\n- We recommend using intent-matching for almost all use cases. We will publish a detailed research later.\n\n```python\nfrom nutcracker.evaluator import MCQEvaluator, generate_report\nevaluation = MCQEvaluator(data=loaded_mmlu, disable_intent_matching=True)\nevaluation.run()\nprint(generate_report(loaded_mmlu, save_path='accuracy_report.txt'))\n```\n\n\nhttps://github.com/brucewlee/nutcracker/assets/54278520/6deb5362-fd48-470e-9964-c794425811d9\n\n\n\n\n---\n\n# Tutorials\n- Evaluating on HuggingFace Inference Endpoints -> [HERE / Medium](https://brucewlee.medium.com/nutcracker-evaluating-on-huggingface-inference-endpoints-6e977e326c5b)\n- Understanding Instance-Task-Pile -> [HERE / Medium](https://brucewlee.medium.com/nutcracker-instance-task-pile-38f646c1b36d)\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "streamline LLM evaluation",
    "version": "0.0.2a2",
    "project_urls": null,
    "split_keywords": [
        "evaluation"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "877d8e115ddfa5623bd03f10f5313990c834e52451ed1c5db48188de81e8acaf",
                "md5": "3a29bd5a5ef15ee9b99a145fbbfa4e23",
                "sha256": "3e3cedd73d423ccc499e462cc810f904679f41d7eaa22e658b39b86d6d019b9e"
            },
            "downloads": -1,
            "filename": "nutcracker_py-0.0.2a2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "3a29bd5a5ef15ee9b99a145fbbfa4e23",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 55119,
            "upload_time": "2024-08-03T10:08:59",
            "upload_time_iso_8601": "2024-08-03T10:08:59.554359Z",
            "url": "https://files.pythonhosted.org/packages/87/7d/8e115ddfa5623bd03f10f5313990c834e52451ed1c5db48188de81e8acaf/nutcracker_py-0.0.2a2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "bd100f95381f9528fa9c07f7c9aa3b15ddd3d316d7adf1fa44f1a863e5cb9c21",
                "md5": "683e7a3f377f62eeda6ed1d6b92bd239",
                "sha256": "74faf99f0d19500288a9a646e65fcde4a7e94c0c4f7fb0965fb3b371ef7e9d1e"
            },
            "downloads": -1,
            "filename": "nutcracker_py-0.0.2a2.tar.gz",
            "has_sig": false,
            "md5_digest": "683e7a3f377f62eeda6ed1d6b92bd239",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 26553,
            "upload_time": "2024-08-03T10:09:01",
            "upload_time_iso_8601": "2024-08-03T10:09:01.259122Z",
            "url": "https://files.pythonhosted.org/packages/bd/10/0f95381f9528fa9c07f7c9aa3b15ddd3d316d7adf1fa44f1a863e5cb9c21/nutcracker_py-0.0.2a2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-08-03 10:09:01",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "nutcracker-py"
}

Bruce W. Lee