python-flexeval


Namepython-flexeval JSON
Version 0.1.5 PyPI version JSON
download
home_pageNone
SummaryFlexEval is a tool for designing custom metrics, completion functions, and LLM-graded rubrics for evaluating the behavior of LLM-powered systems.
upload_time2025-08-01 01:20:35
maintainerNone
docs_urlNone
authorS. Thomas Christie, Zachary Levonian, Baptiste Moreau-Pernet, Anna Rafferty, Terry Yu Tian
requires_python>=3.10
licenseNone
keywords conversation education evaluation large language models learning engineering
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # FlexEval LLM Evals

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.12729993.svg)](https://doi.org/10.5281/zenodo.12729993)
[![License](https://img.shields.io/github/license/DigitalHarborFoundation/FlexEval)](https://github.com/DigitalHarborFoundation/FlexEval/blob/main/LICENSE)

![FlexEval banner](/docs/_static/flexeval_banner.svg)

FlexEval is a tool for designing custom metrics, completion functions, and LLM-graded rubrics for evaluating the behavior of LLM-powered systems.

**Documentation:** <https://digitalharborfoundation.github.io/FlexEval>

Additional details about FlexEval can be found [in our paper](https://doi.org/10.5281/zenodo.12729993) at the _Educational Data Mining_ 2024 conference.

## Usage

Basic usage: 

```python
import flexeval
from flexeval.schema import Eval, EvalRun, FileDataSource, Metrics, FunctionItem, Config

data_sources = [FileDataSource(path="vignettes/conversations.jsonl")]
eval = Eval(metrics=Metrics(function=[FunctionItem(name="flesch_reading_ease")]))
config = Config(clear_tables=True)
eval_run = EvalRun(
    data_sources=data_sources,
    database_path="eval_results.db",
    eval=eval,
    config=config,
)
flexeval.run(eval_run)
```

This example computes [Flesch reading ease](https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests#Flesch_reading_ease) for every turn in a list of conversations provided in JSONL format. The metric values are stored in an SQLite database called `eval_results.db`.

See additional usage examples in the [vignettes](/vignettes).

## Installation

FlexEval is on PyPI as [`python-flexeval`](https://pypi.org/p/python-flexeval). See the [Installation](https://digitalharborfoundation.github.io/FlexEval/getting_started.html#Installation) section in the [Getting Started](https://digitalharborfoundation.github.io/FlexEval/getting_started.html) guide.

Using `pip`:

```bash
pip install python-flexeval
```

## Basic functionality

FlexEval is designed to be "batteries included" for many basic use cases. It supports the following out-of-the-box:

- scoring historical conversations - useful for monitoring live systems.
- scoring LLMs:
  - locally hosted and served via an endpoint using something like [LM Studio](https://lmstudio.ai)
  - LLMs accessible by a REST endpoint and accessible via a network call
  - any OpenAI LLM
- a set of useful rubrics
- a set of useful Python functions

Evaluation results are saved in an SQLite database. See the [Metric Analysis](/vignettes/metric_analysis.ipynb) vignette for a sample analysis demonstrating the structure and utility of the data saved by FlexEval.


Read more in the [Getting Started](https://digitalharborfoundation.github.io/FlexEval/getting_started.html) guide.

## Cite this work

If this work is useful to you, please cite [our EDM 2024 paper](https://educationaldatamining.org/edm2024/proceedings/2024.EDM-posters.107/2024.EDM-posters.107.pdf):

>S. Thomas Christie, Baptiste Moreau-Pernet, Yu Tian, & John Whitmer. (2024). FlexEval: a customizable tool for chatbot performance evaluation and dialogue analysis. _Proceedings of the 17th International Conference on Educational Data Mining_, 903-908. Atlanta, Georgia, USA, July 2024. <https://doi.org/10.5281/zenodo.12729993>

## Development

Pull requests are welcome. Feel free to contribute:
- New rubrics or functions
- Bug fixes
- New features

See [DEVELOPMENT.md](DEVELOPMENT.md).

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "python-flexeval",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "conversation, education, evaluation, large language models, learning engineering",
    "author": "S. Thomas Christie, Zachary Levonian, Baptiste Moreau-Pernet, Anna Rafferty, Terry Yu Tian",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/29/e4/0d9fbc5c53ff8c01b971e1b7602ed780e1256abe03c6c228c6b1b41ece20/python_flexeval-0.1.5.tar.gz",
    "platform": null,
    "description": "# FlexEval LLM Evals\n\n[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.12729993.svg)](https://doi.org/10.5281/zenodo.12729993)\n[![License](https://img.shields.io/github/license/DigitalHarborFoundation/FlexEval)](https://github.com/DigitalHarborFoundation/FlexEval/blob/main/LICENSE)\n\n![FlexEval banner](/docs/_static/flexeval_banner.svg)\n\nFlexEval is a tool for designing custom metrics, completion functions, and LLM-graded rubrics for evaluating the behavior of LLM-powered systems.\n\n**Documentation:** <https://digitalharborfoundation.github.io/FlexEval>\n\nAdditional details about FlexEval can be found [in our paper](https://doi.org/10.5281/zenodo.12729993) at the _Educational Data Mining_ 2024 conference.\n\n## Usage\n\nBasic usage: \n\n```python\nimport flexeval\nfrom flexeval.schema import Eval, EvalRun, FileDataSource, Metrics, FunctionItem, Config\n\ndata_sources = [FileDataSource(path=\"vignettes/conversations.jsonl\")]\neval = Eval(metrics=Metrics(function=[FunctionItem(name=\"flesch_reading_ease\")]))\nconfig = Config(clear_tables=True)\neval_run = EvalRun(\n    data_sources=data_sources,\n    database_path=\"eval_results.db\",\n    eval=eval,\n    config=config,\n)\nflexeval.run(eval_run)\n```\n\nThis example computes [Flesch reading ease](https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests#Flesch_reading_ease) for every turn in a list of conversations provided in JSONL format. The metric values are stored in an SQLite database called `eval_results.db`.\n\nSee additional usage examples in the [vignettes](/vignettes).\n\n## Installation\n\nFlexEval is on PyPI as [`python-flexeval`](https://pypi.org/p/python-flexeval). See the [Installation](https://digitalharborfoundation.github.io/FlexEval/getting_started.html#Installation) section in the [Getting Started](https://digitalharborfoundation.github.io/FlexEval/getting_started.html) guide.\n\nUsing `pip`:\n\n```bash\npip install python-flexeval\n```\n\n## Basic functionality\n\nFlexEval is designed to be \"batteries included\" for many basic use cases. It supports the following out-of-the-box:\n\n- scoring historical conversations - useful for monitoring live systems.\n- scoring LLMs:\n  - locally hosted and served via an endpoint using something like [LM Studio](https://lmstudio.ai)\n  - LLMs accessible by a REST endpoint and accessible via a network call\n  - any OpenAI LLM\n- a set of useful rubrics\n- a set of useful Python functions\n\nEvaluation results are saved in an SQLite database. See the [Metric Analysis](/vignettes/metric_analysis.ipynb) vignette for a sample analysis demonstrating the structure and utility of the data saved by FlexEval.\n\n\nRead more in the [Getting Started](https://digitalharborfoundation.github.io/FlexEval/getting_started.html) guide.\n\n## Cite this work\n\nIf this work is useful to you, please cite [our EDM 2024 paper](https://educationaldatamining.org/edm2024/proceedings/2024.EDM-posters.107/2024.EDM-posters.107.pdf):\n\n>S. Thomas Christie, Baptiste Moreau-Pernet, Yu Tian, & John Whitmer. (2024). FlexEval: a customizable tool for chatbot performance evaluation and dialogue analysis. _Proceedings of the 17th International Conference on Educational Data Mining_, 903-908. Atlanta, Georgia, USA, July 2024. <https://doi.org/10.5281/zenodo.12729993>\n\n## Development\n\nPull requests are welcome. Feel free to contribute:\n- New rubrics or functions\n- Bug fixes\n- New features\n\nSee [DEVELOPMENT.md](DEVELOPMENT.md).\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "FlexEval is a tool for designing custom metrics, completion functions, and LLM-graded rubrics for evaluating the behavior of LLM-powered systems.",
    "version": "0.1.5",
    "project_urls": {
        "GitHub": "https://github.com/DigitalHarborFoundation/FlexEval",
        "Homepage": "https://digitalharborfoundation.github.io/FlexEval/",
        "Issues": "https://github.com/DigitalHarborFoundation/FlexEval/issues"
    },
    "split_keywords": [
        "conversation",
        " education",
        " evaluation",
        " large language models",
        " learning engineering"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "b58a164a5779c61da79c717aebd1f87ba421f6ea8cf284c4871def28889ae5a3",
                "md5": "ba94c7b4b5c093baa933e1bf2c47fec5",
                "sha256": "e78e176c407a05e2470ebf8f569a5a04b071a128f49f80fb619c1940f354d907"
            },
            "downloads": -1,
            "filename": "python_flexeval-0.1.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ba94c7b4b5c093baa933e1bf2c47fec5",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 71809,
            "upload_time": "2025-08-01T01:20:33",
            "upload_time_iso_8601": "2025-08-01T01:20:33.797781Z",
            "url": "https://files.pythonhosted.org/packages/b5/8a/164a5779c61da79c717aebd1f87ba421f6ea8cf284c4871def28889ae5a3/python_flexeval-0.1.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "29e40d9fbc5c53ff8c01b971e1b7602ed780e1256abe03c6c228c6b1b41ece20",
                "md5": "b052843c4301ddccdac0c1e4bccde549",
                "sha256": "2159c759e2c1fbecd96f4cce314a97100cba90ad3bd5f522266978496cef18da"
            },
            "downloads": -1,
            "filename": "python_flexeval-0.1.5.tar.gz",
            "has_sig": false,
            "md5_digest": "b052843c4301ddccdac0c1e4bccde549",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 647963,
            "upload_time": "2025-08-01T01:20:35",
            "upload_time_iso_8601": "2025-08-01T01:20:35.626845Z",
            "url": "https://files.pythonhosted.org/packages/29/e4/0d9fbc5c53ff8c01b971e1b7602ed780e1256abe03c6c228c6b1b41ece20/python_flexeval-0.1.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-01 01:20:35",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "DigitalHarborFoundation",
    "github_project": "FlexEval",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "python-flexeval"
}
        
Elapsed time: 1.75861s