# FlexEval LLM Evals
[](https://doi.org/10.5281/zenodo.12729993)
[](https://github.com/DigitalHarborFoundation/FlexEval/blob/main/LICENSE)

FlexEval is a tool for designing custom metrics, completion functions, and LLM-graded rubrics for evaluating the behavior of LLM-powered systems.
**Documentation:** <https://digitalharborfoundation.github.io/FlexEval>
Additional details about FlexEval can be found [in our paper](https://doi.org/10.5281/zenodo.12729993) at the _Educational Data Mining_ 2024 conference.
## Usage
Basic usage:
```python
import flexeval
from flexeval.schema import Eval, EvalRun, FileDataSource, Metrics, FunctionItem, Config
data_sources = [FileDataSource(path="vignettes/conversations.jsonl")]
eval = Eval(metrics=Metrics(function=[FunctionItem(name="flesch_reading_ease")]))
config = Config(clear_tables=True)
eval_run = EvalRun(
data_sources=data_sources,
database_path="eval_results.db",
eval=eval,
config=config,
)
flexeval.run(eval_run)
```
This example computes [Flesch reading ease](https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests#Flesch_reading_ease) for every turn in a list of conversations provided in JSONL format. The metric values are stored in an SQLite database called `eval_results.db`.
See additional usage examples in the [vignettes](/vignettes).
## Installation
FlexEval is on PyPI as [`python-flexeval`](https://pypi.org/p/python-flexeval). See the [Installation](https://digitalharborfoundation.github.io/FlexEval/getting_started.html#Installation) section in the [Getting Started](https://digitalharborfoundation.github.io/FlexEval/getting_started.html) guide.
Using `pip`:
```bash
pip install python-flexeval
```
## Basic functionality
FlexEval is designed to be "batteries included" for many basic use cases. It supports the following out-of-the-box:
- scoring historical conversations - useful for monitoring live systems.
- scoring LLMs:
- locally hosted and served via an endpoint using something like [LM Studio](https://lmstudio.ai)
- LLMs accessible by a REST endpoint and accessible via a network call
- any OpenAI LLM
- a set of useful rubrics
- a set of useful Python functions
Evaluation results are saved in an SQLite database. See the [Metric Analysis](/vignettes/metric_analysis.ipynb) vignette for a sample analysis demonstrating the structure and utility of the data saved by FlexEval.
Read more in the [Getting Started](https://digitalharborfoundation.github.io/FlexEval/getting_started.html) guide.
## Cite this work
If this work is useful to you, please cite [our EDM 2024 paper](https://educationaldatamining.org/edm2024/proceedings/2024.EDM-posters.107/2024.EDM-posters.107.pdf):
>S. Thomas Christie, Baptiste Moreau-Pernet, Yu Tian, & John Whitmer. (2024). FlexEval: a customizable tool for chatbot performance evaluation and dialogue analysis. _Proceedings of the 17th International Conference on Educational Data Mining_, 903-908. Atlanta, Georgia, USA, July 2024. <https://doi.org/10.5281/zenodo.12729993>
## Development
Pull requests are welcome. Feel free to contribute:
- New rubrics or functions
- Bug fixes
- New features
See [DEVELOPMENT.md](DEVELOPMENT.md).
Raw data
{
"_id": null,
"home_page": null,
"name": "python-flexeval",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "conversation, education, evaluation, large language models, learning engineering",
"author": "S. Thomas Christie, Zachary Levonian, Baptiste Moreau-Pernet, Anna Rafferty, Terry Yu Tian",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/29/e4/0d9fbc5c53ff8c01b971e1b7602ed780e1256abe03c6c228c6b1b41ece20/python_flexeval-0.1.5.tar.gz",
"platform": null,
"description": "# FlexEval LLM Evals\n\n[](https://doi.org/10.5281/zenodo.12729993)\n[](https://github.com/DigitalHarborFoundation/FlexEval/blob/main/LICENSE)\n\n\n\nFlexEval is a tool for designing custom metrics, completion functions, and LLM-graded rubrics for evaluating the behavior of LLM-powered systems.\n\n**Documentation:** <https://digitalharborfoundation.github.io/FlexEval>\n\nAdditional details about FlexEval can be found [in our paper](https://doi.org/10.5281/zenodo.12729993) at the _Educational Data Mining_ 2024 conference.\n\n## Usage\n\nBasic usage: \n\n```python\nimport flexeval\nfrom flexeval.schema import Eval, EvalRun, FileDataSource, Metrics, FunctionItem, Config\n\ndata_sources = [FileDataSource(path=\"vignettes/conversations.jsonl\")]\neval = Eval(metrics=Metrics(function=[FunctionItem(name=\"flesch_reading_ease\")]))\nconfig = Config(clear_tables=True)\neval_run = EvalRun(\n data_sources=data_sources,\n database_path=\"eval_results.db\",\n eval=eval,\n config=config,\n)\nflexeval.run(eval_run)\n```\n\nThis example computes [Flesch reading ease](https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests#Flesch_reading_ease) for every turn in a list of conversations provided in JSONL format. The metric values are stored in an SQLite database called `eval_results.db`.\n\nSee additional usage examples in the [vignettes](/vignettes).\n\n## Installation\n\nFlexEval is on PyPI as [`python-flexeval`](https://pypi.org/p/python-flexeval). See the [Installation](https://digitalharborfoundation.github.io/FlexEval/getting_started.html#Installation) section in the [Getting Started](https://digitalharborfoundation.github.io/FlexEval/getting_started.html) guide.\n\nUsing `pip`:\n\n```bash\npip install python-flexeval\n```\n\n## Basic functionality\n\nFlexEval is designed to be \"batteries included\" for many basic use cases. It supports the following out-of-the-box:\n\n- scoring historical conversations - useful for monitoring live systems.\n- scoring LLMs:\n - locally hosted and served via an endpoint using something like [LM Studio](https://lmstudio.ai)\n - LLMs accessible by a REST endpoint and accessible via a network call\n - any OpenAI LLM\n- a set of useful rubrics\n- a set of useful Python functions\n\nEvaluation results are saved in an SQLite database. See the [Metric Analysis](/vignettes/metric_analysis.ipynb) vignette for a sample analysis demonstrating the structure and utility of the data saved by FlexEval.\n\n\nRead more in the [Getting Started](https://digitalharborfoundation.github.io/FlexEval/getting_started.html) guide.\n\n## Cite this work\n\nIf this work is useful to you, please cite [our EDM 2024 paper](https://educationaldatamining.org/edm2024/proceedings/2024.EDM-posters.107/2024.EDM-posters.107.pdf):\n\n>S. Thomas Christie, Baptiste Moreau-Pernet, Yu Tian, & John Whitmer. (2024). FlexEval: a customizable tool for chatbot performance evaluation and dialogue analysis. _Proceedings of the 17th International Conference on Educational Data Mining_, 903-908. Atlanta, Georgia, USA, July 2024. <https://doi.org/10.5281/zenodo.12729993>\n\n## Development\n\nPull requests are welcome. Feel free to contribute:\n- New rubrics or functions\n- Bug fixes\n- New features\n\nSee [DEVELOPMENT.md](DEVELOPMENT.md).\n",
"bugtrack_url": null,
"license": null,
"summary": "FlexEval is a tool for designing custom metrics, completion functions, and LLM-graded rubrics for evaluating the behavior of LLM-powered systems.",
"version": "0.1.5",
"project_urls": {
"GitHub": "https://github.com/DigitalHarborFoundation/FlexEval",
"Homepage": "https://digitalharborfoundation.github.io/FlexEval/",
"Issues": "https://github.com/DigitalHarborFoundation/FlexEval/issues"
},
"split_keywords": [
"conversation",
" education",
" evaluation",
" large language models",
" learning engineering"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "b58a164a5779c61da79c717aebd1f87ba421f6ea8cf284c4871def28889ae5a3",
"md5": "ba94c7b4b5c093baa933e1bf2c47fec5",
"sha256": "e78e176c407a05e2470ebf8f569a5a04b071a128f49f80fb619c1940f354d907"
},
"downloads": -1,
"filename": "python_flexeval-0.1.5-py3-none-any.whl",
"has_sig": false,
"md5_digest": "ba94c7b4b5c093baa933e1bf2c47fec5",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 71809,
"upload_time": "2025-08-01T01:20:33",
"upload_time_iso_8601": "2025-08-01T01:20:33.797781Z",
"url": "https://files.pythonhosted.org/packages/b5/8a/164a5779c61da79c717aebd1f87ba421f6ea8cf284c4871def28889ae5a3/python_flexeval-0.1.5-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "29e40d9fbc5c53ff8c01b971e1b7602ed780e1256abe03c6c228c6b1b41ece20",
"md5": "b052843c4301ddccdac0c1e4bccde549",
"sha256": "2159c759e2c1fbecd96f4cce314a97100cba90ad3bd5f522266978496cef18da"
},
"downloads": -1,
"filename": "python_flexeval-0.1.5.tar.gz",
"has_sig": false,
"md5_digest": "b052843c4301ddccdac0c1e4bccde549",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 647963,
"upload_time": "2025-08-01T01:20:35",
"upload_time_iso_8601": "2025-08-01T01:20:35.626845Z",
"url": "https://files.pythonhosted.org/packages/29/e4/0d9fbc5c53ff8c01b971e1b7602ed780e1256abe03c6c228c6b1b41ece20/python_flexeval-0.1.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-01 01:20:35",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "DigitalHarborFoundation",
"github_project": "FlexEval",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "python-flexeval"
}