# Eyeball++
_Simple and fun evaluation for LLM tasks_
eyeball_pp is a framework to evaluate tasks which use llms.
LLMs make is very easy to build intelligent systems, but the systems are probabilistic. You can't say my system does X, rather you say that my system is trying to do X and it's succesful Y% of times.
This framework helps you answer questions like: "Hows my system performing as I change it?", "Which llm is the best for my specific task?" or "This prompt change looks good on the example I just tested, does it work with all the other things I've tested?"
Your task can be arbitrary, the framework only cares about inputs and outputs.
If you've been eyeballing how well your changes are working, this framework should fit right in and help you evaluate your task in a more methodical manner WITHOUT you having to become methodical.
# Installation
eyeball_pp is a python library which can be installed via pip
`pip install eyeball_pp`
# Example
To see an example see this [notebook](examples/qa_task_notebook.ipynb) or checkout more examples in the [examples](examples/) folder.
# Concepts
eyeball_pp has 3 simple concepts -- record, evaluate and rerun.
## Record
eyeball_pp records the inputs and outputs of your task runs as you are running it and saves them as checkpoints. You can record this locally while developing or to a db from a production system. You can optionally record human feedback for the task output too.
You can record your task using the `record_task` decorator. This will record every run of this function call as a `Checkpoint` for future comparison. The input_names specify which inputs to record and the function return value is saved as the output.
```python
import eyeball_pp
@eyeball_pp.record_task(input_names=['input_a', 'input_b'])
def your_task_function(input_a, input_b):
# Your task can run arbitrary code
...
return task_output
```
If you want to record additional inputs within your function call you can just call `eyeball_pp.record_input('variable_name', value)` inside your function.
If your sytem is more complicated, you can also use the `record_input` and `record_output` functions with the `start_recording_session` context manager
```python
import eyeball_pp
with eyeball_pp.start_recording_session(task_name="your_task", checkpoint_id="some_custom_unique_id"):
eyeball_pp.record_input('input_a', input_a_value)
eyeball_pp.record_input('input_b', input_b_value)
# your task code
eyeball_pp.record_output(output)
```
OR without the context manager..
```python
eyeball_pp.record_input(task_name="your_task", checkpoint_id="some_custom_unique_id", variable_name="input_a", value=input_a_value)
..
eyeball_pp.record_output(task_name="your_task", checkpoint_id="some_custom_unique_id", variable_name='output', output=output)
```
Data by default gets recorded as yaml files in your repo so you can always inspect it, change it or delete it. If you want to log data from a production system you can record it in your own db or get an api key from https://eyeball.tark.ai
Once you have recorded a few runs, you can then evaluate your system.
## Evaluate
eyeball_pp tells you how your system is performing by looking at how many runs were succesful. If the output of your task can be evaluated objectively then you can supply a custom `output_grader` and if not you can just use the built in model graded eval. This will use a model to figure out if your task output is solving the objective you want it to. And if you've been recording human feedback for your task runs, it will use this feedback to fine-tune the evaluator model
```python
# The example below uses the built in model graded eval
eyeball_pp.evaluate_system(task_objective="The task should answer questions based on the context provided and also show the sources")
```
Example output of the above command would be something like
| Date | Results | Stats |
| --- | --- | --- |
| 09 Aug | 85.7% success (6.0/7) | 7 datapoints, 3 unique inputs |
| Run | Results | Stats | Params | Output Variance (higher value ⇒ unpredictable system) |
| --- | --- | --- | --- | --- |
| Rerun on 09 Aug 05:39 | 66.7% success (2/3) | 3 datapoints, 3 unique inputs | temperature=0.2 | |
| 09 Aug 05:30 - 05:39 | 100.0% success (4/4) | 4 datapoints, 3 unique inputs | | 0.0 |
The comparison will also output system_health_by_date.md file, system_health_by_run_history.md and per_input_breakdown.md files in your repo with the latest system health updates.
## Rerun
You can then re-run these pre-recorded examples as you make changes. This will only re-run each unique set of input variables. eg. if you recorded a run of `your_task_function(1, 2)` 5 times and `your_task_function(3, 4)` 2 times, the re-run would only run 2 examples -- (1, 2) and (3, 4). Each of these re-runs is saved as a new checkpoint for comparison later.
```python
from eyeball_pp import rerun_recorded_examples
for input_vars in rerun_recorded_examples(input_names=['input_a', 'input_b']):
your_task_function(input_vars['input_a'], input_vars['input_b'])
```
You can also re-run the examples with a set of parameters you want to evaluate (eg. model, temperature etc.) -- These params will show up in the comparison later and help you decide which params result in the best performance on your task.
```python
from eyeball_pp import get_eval_param, rerun_recorded_examples
for vars in rerun_recorded_examples(input_names=['input_a', 'input_b'], eval_params_list=[{'model': 'gpt-4', 'temperature': 0.7}, {'model': 'mpt-30b-chat'}]):
your_task_function(vars['input_a'], vars['input_b'])
# You can access these eval params from anywhere in your code
@eyeball_pp.record_task(input_names=['input_a', 'input_b'])
def your_task_function(input_a, input_b):
...
model = get_eval_param('model') or 'gpt-3.5-turbo'
temperature = get_eval_param('temperature') or 0.5
```
# Configuration
## Serialization
For the `record_task` decorator you need to ensure that the inputs to the function and outputs are json serialable. If the variables are custom classes you can define the `to_json` function on that object.
eg.
```python
@eyeball_pp.record_task(args_to_skip=["input_a"])
def your_task_function(input_a, input_b: SomeComplexType) -> SomeComplexOutput:
...
return task_output
class SomeComplexType:
def to_json(self) -> str:
...
class SomeComplexOutput:
def to_json(self) -> str:
...
```
## Sample rate
You can set the sample rate for recording, by default it's 1.0
```python
# Set separate config for dev and production
if is_dev_build():
eyeball_pp.set_config(sample_rate=1.0, dir_path="/path/to/store/recorded/examples")
else:
# More details on the api_key integration below
eyeball_pp.set_config(sample_rate=0.1, api_key=<eyeball_pp_apikey>)
```
## Custom example_id
By default a unique example_id is created based on the values of the input variables, but that might not always work.
```python
@eyeball_pp.record_task(example_id_arg_name='request_id', args_to_skip=['request_id'])
def your_task_function(request_id, input_a, input_b):
...
return task_output
```
## Refining the evaluator with Human feedback
If you record human feedback from your production system you can always log that via the library
```python
import eyeball_pp
from eyeball_pp import ResponseFeedback
eyeball_pp.record_human_feedback(checkpoint_id, response_feedback=ResponseFeedback.POSITIVE, feedback_details="I liked the response as it selected the sources correctly")
```
But you can also rate examples yourself
```python
eyeball_pp.rate_examples()
```
This will let you compare and rate examples yourself via the command line
Example output of this command would look like:
```
Consider Example(input_a=1, input_b=2):
Does the output `4` fulfil the objecive of the task? (Y/n)
Does the output `3` fulfil the objecive of the task? (Y/n)
Which response is better?
A) `3` is better than `4`
B) `4` is better than `3`
C) Both are equivalent
```
## Api key
eyeball_pp works out of the box locally but if you want to record information from production you can get an apikey from https://eyeball.tark.ai
Raw data
{
"_id": null,
"home_page": "https://github.com/revantk/eyeball-plus-plus",
"name": "eyeball-pp",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "",
"author": "Revant",
"author_email": "revant.kapoor@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/25/1b/480f795b1af5dca3fcfd7746ef3b076fd670431109064fbe182acf97bb1e/eyeball_pp-0.0.15.tar.gz",
"platform": null,
"description": "# Eyeball++ \n_Simple and fun evaluation for LLM tasks_\n\neyeball_pp is a framework to evaluate tasks which use llms.\n\nLLMs make is very easy to build intelligent systems, but the systems are probabilistic. You can't say my system does X, rather you say that my system is trying to do X and it's succesful Y% of times.\n\nThis framework helps you answer questions like: \"Hows my system performing as I change it?\", \"Which llm is the best for my specific task?\" or \"This prompt change looks good on the example I just tested, does it work with all the other things I've tested?\"\n\nYour task can be arbitrary, the framework only cares about inputs and outputs.\n\nIf you've been eyeballing how well your changes are working, this framework should fit right in and help you evaluate your task in a more methodical manner WITHOUT you having to become methodical.\n\n# Installation\neyeball_pp is a python library which can be installed via pip \n\n`pip install eyeball_pp`\n\n# Example \nTo see an example see this [notebook](examples/qa_task_notebook.ipynb) or checkout more examples in the [examples](examples/) folder. \n\n# Concepts \neyeball_pp has 3 simple concepts -- record, evaluate and rerun. \n\n## Record\neyeball_pp records the inputs and outputs of your task runs as you are running it and saves them as checkpoints. You can record this locally while developing or to a db from a production system. You can optionally record human feedback for the task output too.\n\nYou can record your task using the `record_task` decorator. This will record every run of this function call as a `Checkpoint` for future comparison. The input_names specify which inputs to record and the function return value is saved as the output.\n```python\nimport eyeball_pp\n\n@eyeball_pp.record_task(input_names=['input_a', 'input_b'])\ndef your_task_function(input_a, input_b):\n # Your task can run arbitrary code\n ...\n return task_output\n```\nIf you want to record additional inputs within your function call you can just call `eyeball_pp.record_input('variable_name', value)` inside your function.\n\nIf your sytem is more complicated, you can also use the `record_input` and `record_output` functions with the `start_recording_session` context manager\n```python\nimport eyeball_pp\n\nwith eyeball_pp.start_recording_session(task_name=\"your_task\", checkpoint_id=\"some_custom_unique_id\"):\n eyeball_pp.record_input('input_a', input_a_value)\n eyeball_pp.record_input('input_b', input_b_value)\n # your task code\n eyeball_pp.record_output(output)\n```\n\nOR without the context manager.. \n\n```python\neyeball_pp.record_input(task_name=\"your_task\", checkpoint_id=\"some_custom_unique_id\", variable_name=\"input_a\", value=input_a_value)\n..\neyeball_pp.record_output(task_name=\"your_task\", checkpoint_id=\"some_custom_unique_id\", variable_name='output', output=output)\n```\n\nData by default gets recorded as yaml files in your repo so you can always inspect it, change it or delete it. If you want to log data from a production system you can record it in your own db or get an api key from https://eyeball.tark.ai \n\nOnce you have recorded a few runs, you can then evaluate your system.\n## Evaluate\neyeball_pp tells you how your system is performing by looking at how many runs were succesful. If the output of your task can be evaluated objectively then you can supply a custom `output_grader` and if not you can just use the built in model graded eval. This will use a model to figure out if your task output is solving the objective you want it to. And if you've been recording human feedback for your task runs, it will use this feedback to fine-tune the evaluator model\n\n```python\n# The example below uses the built in model graded eval\neyeball_pp.evaluate_system(task_objective=\"The task should answer questions based on the context provided and also show the sources\")\n```\n\nExample output of the above command would be something like\n\n| Date | Results | Stats |\n| --- | --- | --- |\n| 09 Aug | 85.7% success (6.0/7) | 7 datapoints, 3 unique inputs |\n\n| Run | Results | Stats | Params | Output Variance (higher value \u21d2 unpredictable system) |\n| --- | --- | --- | --- | --- |\n| Rerun on 09 Aug 05:39 | 66.7% success (2/3) | 3 datapoints, 3 unique inputs | temperature=0.2 | |\n| 09 Aug 05:30 - 05:39 | 100.0% success (4/4) | 4 datapoints, 3 unique inputs | | 0.0 |\n\n\nThe comparison will also output system_health_by_date.md file, system_health_by_run_history.md and per_input_breakdown.md files in your repo with the latest system health updates.\n## Rerun\nYou can then re-run these pre-recorded examples as you make changes. This will only re-run each unique set of input variables. eg. if you recorded a run of `your_task_function(1, 2)` 5 times and `your_task_function(3, 4)` 2 times, the re-run would only run 2 examples -- (1, 2) and (3, 4). Each of these re-runs is saved as a new checkpoint for comparison later. \n```python\nfrom eyeball_pp import rerun_recorded_examples \n\nfor input_vars in rerun_recorded_examples(input_names=['input_a', 'input_b']):\n your_task_function(input_vars['input_a'], input_vars['input_b'])\n```\n\nYou can also re-run the examples with a set of parameters you want to evaluate (eg. model, temperature etc.) -- These params will show up in the comparison later and help you decide which params result in the best performance on your task.\n```python\nfrom eyeball_pp import get_eval_param, rerun_recorded_examples \n\nfor vars in rerun_recorded_examples(input_names=['input_a', 'input_b'], eval_params_list=[{'model': 'gpt-4', 'temperature': 0.7}, {'model': 'mpt-30b-chat'}]):\n your_task_function(vars['input_a'], vars['input_b'])\n\n# You can access these eval params from anywhere in your code\n@eyeball_pp.record_task(input_names=['input_a', 'input_b'])\ndef your_task_function(input_a, input_b):\n ...\n model = get_eval_param('model') or 'gpt-3.5-turbo'\n temperature = get_eval_param('temperature') or 0.5\n```\n\n\n\n# Configuration \n\n## Serialization\nFor the `record_task` decorator you need to ensure that the inputs to the function and outputs are json serialable. If the variables are custom classes you can define the `to_json` function on that object.\neg. \n```python\n@eyeball_pp.record_task(args_to_skip=[\"input_a\"])\ndef your_task_function(input_a, input_b: SomeComplexType) -> SomeComplexOutput:\n ...\n return task_output\n\nclass SomeComplexType:\n def to_json(self) -> str:\n ...\n\nclass SomeComplexOutput:\n def to_json(self) -> str:\n ...\n```\n\n## Sample rate \nYou can set the sample rate for recording, by default it's 1.0\n```python\n# Set separate config for dev and production \nif is_dev_build():\n eyeball_pp.set_config(sample_rate=1.0, dir_path=\"/path/to/store/recorded/examples\")\nelse:\n # More details on the api_key integration below\n eyeball_pp.set_config(sample_rate=0.1, api_key=<eyeball_pp_apikey>)\n```\n\n## Custom example_id\nBy default a unique example_id is created based on the values of the input variables, but that might not always work.\n\n```python\n@eyeball_pp.record_task(example_id_arg_name='request_id', args_to_skip=['request_id'])\ndef your_task_function(request_id, input_a, input_b):\n ...\n return task_output\n```\n\n## Refining the evaluator with Human feedback\nIf you record human feedback from your production system you can always log that via the library\n```python\nimport eyeball_pp\nfrom eyeball_pp import ResponseFeedback\n\neyeball_pp.record_human_feedback(checkpoint_id, response_feedback=ResponseFeedback.POSITIVE, feedback_details=\"I liked the response as it selected the sources correctly\")\n```\n\nBut you can also rate examples yourself \n\n```python\neyeball_pp.rate_examples()\n```\nThis will let you compare and rate examples yourself via the command line\nExample output of this command would look like:\n\n```\nConsider Example(input_a=1, input_b=2):\nDoes the output `4` fulfil the objecive of the task? (Y/n)\n\nDoes the output `3` fulfil the objecive of the task? (Y/n)\n\nWhich response is better? \nA) `3` is better than `4`\nB) `4` is better than `3`\nC) Both are equivalent\n```\n\n## Api key \neyeball_pp works out of the box locally but if you want to record information from production you can get an apikey from https://eyeball.tark.ai\n\n",
"bugtrack_url": null,
"license": "",
"summary": "A python package for evaluating tasks which use llms",
"version": "0.0.15",
"project_urls": {
"Homepage": "https://github.com/revantk/eyeball-plus-plus"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "631ff0dfcf6561fd7dc9ad7a6d441b8ad2da2915df04f6a26266a95c98d52d5c",
"md5": "3657c8bb6efed9d09e82ddc7b9e8699f",
"sha256": "9aaf863e55f0ad839e78a83589689c32e0b7d30c83c2251ad1ee075758827f74"
},
"downloads": -1,
"filename": "eyeball_pp-0.0.15-py3-none-any.whl",
"has_sig": false,
"md5_digest": "3657c8bb6efed9d09e82ddc7b9e8699f",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 37085,
"upload_time": "2023-09-07T22:49:20",
"upload_time_iso_8601": "2023-09-07T22:49:20.773003Z",
"url": "https://files.pythonhosted.org/packages/63/1f/f0dfcf6561fd7dc9ad7a6d441b8ad2da2915df04f6a26266a95c98d52d5c/eyeball_pp-0.0.15-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "251b480f795b1af5dca3fcfd7746ef3b076fd670431109064fbe182acf97bb1e",
"md5": "6a9357d28b1a9ad4e211448b12613791",
"sha256": "38819972836d133a01f055a3a28a6567df62610e40fa5017e970e8c6b55d80a3"
},
"downloads": -1,
"filename": "eyeball_pp-0.0.15.tar.gz",
"has_sig": false,
"md5_digest": "6a9357d28b1a9ad4e211448b12613791",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 36573,
"upload_time": "2023-09-07T22:49:22",
"upload_time_iso_8601": "2023-09-07T22:49:22.547461Z",
"url": "https://files.pythonhosted.org/packages/25/1b/480f795b1af5dca3fcfd7746ef3b076fd670431109064fbe182acf97bb1e/eyeball_pp-0.0.15.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-09-07 22:49:22",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "revantk",
"github_project": "eyeball-plus-plus",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "altair",
"specs": [
[
"<",
"5.1.0"
],
[
">=",
"5.0.1"
]
]
},
{
"name": "anthropic",
"specs": [
[
"<",
"0.4.0"
],
[
">=",
"0.3.7"
]
]
},
{
"name": "InquirerPy",
"specs": [
[
"<",
"0.4.0"
],
[
">=",
"0.3.4"
]
]
},
{
"name": "openai",
"specs": [
[
">=",
"0.27.8"
],
[
"<",
"0.28.0"
]
]
},
{
"name": "pandas",
"specs": [
[
"<",
"2.1.0"
],
[
">=",
"2.0.3"
]
]
},
{
"name": "PyYAML",
"specs": [
[
"<",
"6.1"
],
[
">=",
"6.0"
]
]
},
{
"name": "requests",
"specs": [
[
"<",
"2.32.0"
],
[
">=",
"2.31.0"
]
]
},
{
"name": "rich",
"specs": [
[
">=",
"13.4.2"
],
[
"<",
"13.5.0"
]
]
},
{
"name": "streamlit",
"specs": [
[
">=",
"1.25.0"
],
[
"<",
"1.26.0"
]
]
},
{
"name": "tiktoken",
"specs": [
[
">=",
"0.4.0"
],
[
"<",
"0.5.0"
]
]
},
{
"name": "tqdm",
"specs": [
[
">=",
"4.66.0"
],
[
"<",
"4.67.0"
]
]
},
{
"name": "fire",
"specs": [
[
"==",
"0.5.0"
]
]
}
],
"lcname": "eyeball-pp"
}