> [!NOTE]
> This project is under development. The API may undergo major changes between versions, so we recommend checking the [CHANGELOG](https://github.com/nhsengland/evalsense/blob/main/CHANGELOG.md) for any breaking changes before upgrading.
# EvalSense: LLM Evaluation
<div align="center">
[](https://github.com/GIScience/badges#experimental)
[](https://pypi.org/project/evalsense/)
[](https://github.com/nhsengland/evalsense/blob/main/LICENCE)
[](https://github.com/nhsengland/evalsense/actions/workflows/evalsense.yml)
[](https://github.com/nhsengland/evalsense/actions/workflows/guide.yml)
[](https://www.python.org/)
[](https://www.typescriptlang.org/)
[](https://react.dev/)
</div>
<div align="center">
[](https://www.python.org/downloads/)
[](https://github.com/astral-sh/uv)
[](https://github.com/astral-sh/ruff)
[](https://microsoft.github.io/pyright/)
[](https://eslint.org/)
</div>
## About
EvalSense is a framework for systematic evaluation of large language models (LLMs) on open-ended generation tasks, with a particular focus on bespoke, domain-specific evaluations. Some of its key features include:
- **Broad model support.** Out-of-the-box compatibility with a wide range of local and API-based model providers, including [Ollama](https://github.com/ollama/ollama), [Hugging Face](https://github.com/huggingface/transformers), [vLLM](https://github.com/vllm-project/vllm), [OpenAI](https://platform.openai.com/docs/api-reference/introduction), [Anthropic](https://docs.claude.com/en/home) and [others](https://inspect.aisi.org.uk/providers.html).
- **Evaluation guidance.** An [interactive evaluation guide](https://nhsengland.github.io/evalsense/guide) and automated meta-evaluation tools assist in selecting the most appropriate evaluation methods for a specific use-case, including the use of perturbed data to assess method effectiveness.
- **Interactive UI.** A [web-based interface](https://nhsengland.github.io/evalsense/docs/#web-based-ui) enables rapid experimentation with different evaluation workflows without requiring any code.
- **Advanced evaluation methods.** EvalSense incorporates recent LLM-as-a-Judge and hybrid [evaluation approaches](https://nhsengland.github.io/evalsense/docs/api-reference/evaluation/evaluators/), such as [G-Eval](https://nhsengland.github.io/evalsense/docs/api-reference/evaluation/evaluators/#evalsense.evaluation.evaluators.GEvalScoreCalculator) and [QAGS](https://nhsengland.github.io/evalsense/docs/api-reference/evaluation/evaluators/#evalsense.evaluation.evaluators.QagsConfig), while also supporting more traditional metrics like [BERTScore](https://nhsengland.github.io/evalsense/docs/api-reference/evaluation/evaluators/#evalsense.evaluation.evaluators.BertScoreCalculator) and [ROUGE](https://nhsengland.github.io/evalsense/docs/api-reference/evaluation/evaluators/#evalsense.evaluation.evaluators.RougeScoreCalculator).
- **Efficient execution.** Intelligent experiment scheduling and resource management minimise computational overhead for local models. For remote APIs, EvalSense uses asynchrnous parallel calls to maximise throughput.
- **Modularity and extensibility.** Key components and evaluation methods can be used independently or replaced with user-defined implementations.
- **Comprehensive logging.** All key aspects of evaluation are recorded in machine-readable logs, including model parameters, prompts, model outputs, evaluation results, and other metadata.
More information about EvalSense can be found on its [homepage](https://nhsengland.github.io/evalsense/) and in its [documentation](https://nhsengland.github.io/evalsense/docs/).
_**Note:** Only public or fake data are shared in this repository._
## Project Stucture
- The main code for the EvalSense Python package can be found under [`evalsense/`](https://github.com/nhsengland/evalsense/tree/main/evalsense).
- The accompanying documentation is available in the [`docs/`](https://github.com/nhsengland/evalsense/tree/main/docs) folder.
- Code for the interactive LLM evaluation guide is located under [`guide/`](https://github.com/nhsengland/evalsense/tree/main/guide).
- Jupyter notebooks with the evaluation experiments and examples are located under [`notebooks/`](https://github.com/nhsengland/evalsense/tree/main/notebooks).
## Getting Started
### Installation
You can install the project using [pip](https://pip.pypa.io/en/stable/) by running the following command:
```bash
pip install evalsense
```
This will install the latest released version of the package from [PyPI](https://pypi.org/project/evalsense/).
Depending on your use-case, you may want to install additional optional dependencies from the following groups:
- `interactive`: For running experiments interactively in Jupyter notebooks (only needed if you don't already have the necessary libraries installed).
- `transformers`: For using models and metrics requiring the [Hugging Face Transformers](https://huggingface.co/docs/transformers/index) library.
- `vllm`: For using models and metrics requiring [vLLM](https://docs.vllm.ai/en/stable/).
- `local`: For installing all local model dependencies (currently includes `transformers` and `vllm`).
- `all`: For installing all optional dependencies.
For example, if you want to install EvalSense with all optional dependencies, you can run:
```bash
pip install "evalsense[all]"
```
If you want to use EvalSense with Jupyter notebooks (`interactive`) and Hugging Face Transformers (`transformers`), you can run:
```bash
pip install "evalsense[interactive,transformers]"
```
and similarly for other combinations.
### Installation for Development
To install the project for local development, you can follow the steps below:
To clone the repo:
`git clone git@github.com:nhsengland/evalsense.git`
To setup the Python environment for the project:
- Install [uv](https://github.com/astral-sh/uv) if it's not installed already
- `uv sync --all-extras`
- `source .venv/bin/activate`
- `pre-commit install`
Note that the code is formatted with [ruff](https://github.com/astral-sh/ruff) and type-checked by [pyright](https://github.com/microsoft/pyright) in `standard` type checking mode. For the best development experience, we recommend enabling the corresponding extensions in your preferred code editor.
To setup the Node environment for the LLM evaluation guide (located under [`guide/`](https://github.com/nhsengland/evalsense/tree/main/guide)):
- Install [node](https://nodejs.org/en/download) if it's not installed already
- Change to the `guide/` directory (`cd guide`)
- `npm install`
- `npm run start` to run the development server
See also the separate [README.md](https://github.com/nhsengland/evalsense/tree/main/guide/README.md) for the guide.
### Programmatic Usage
For examples illustrating the usage of EvalSense, please check the notebooks under the `notebooks/` folder:
- The [Demo notebook](https://github.com/nhsengland/evalsense/blob/main/notebooks/Demo.ipynb) illustrates a basic application of EvalSense to the ACI-Bench dataset.
- The [Experiments notebook](https://github.com/nhsengland/evalsense/blob/main/notebooks/Experiments.ipynb) illustrates more thorough experiments on the same dataset, involving a larger number of evaluators and models.
- The [Meta-Evaluation notebook](https://github.com/nhsengland/evalsense/blob/main/notebooks/Meta-Evaluation.ipynb) focuses on meta-evaluation on synthetically perturbed data, where the goal is to identify the most reliable evaluation methods rather than the best-performing models.
### Web-Based UI
To use the interactive web-based UI implemented in EvalSense, simply run
```
evalsense webui
```
after installing the package and its dependencies.
## Contributing
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are **greatly appreciated**.
1. Fork the Project
2. Create your Feature Branch (`git checkout -b amazing-feature`)
3. Commit your Changes (`git commit -m 'Add some amazing feature'`)
4. Push to the Branch (`git push origin amazing-feature`)
5. Open a Pull Request
_See [CONTRIBUTING.md](./CONTRIBUTING.md) for detailed guidance._
## License
Unless stated otherwise, the codebase is released under [the MIT Licence][mit].
This covers both the codebase and any sample code in the documentation.
_See [LICENSE](./LICENSE) for more information._
The documentation is [© Crown copyright][copyright] and available under the terms
of the [Open Government 3.0][ogl] licence.
[mit]: LICENCE
[copyright]: http://www.nationalarchives.gov.uk/information-management/re-using-public-sector-information/uk-government-licensing-framework/crown-copyright/
[ogl]: http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
### Contact
This project is currently maintained by [@adamdejl](https://github.com/adamdejl). If you have any questions, suggestions for new features or want to report a bug, please [open an issue](https://github.com/nhsengland/evalsense/issues/new/choose). For security concerns, please file a [private vulnerability report](https://github.com/nhsengland/evalsense/security/advisories/new).
To find out more about the [NHS England Data Science](https://nhsengland.github.io/datascience/) visit our [project website](https://nhsengland.github.io/datascience/our_work/) or get in touch at [datascience@nhs.net](mailto:datascience@nhs.net).
### Acknowledgements
We thank the [Inspect AI development team](https://github.com/UKGovernmentBEIS/inspect_ai/graphs/contributors) for their work on the [Inspect AI library](https://inspect.aisi.org.uk/), which serves as a basis for EvalSense.
Raw data
{
"_id": null,
"home_page": null,
"name": "evalsense",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.12",
"maintainer_email": null,
"keywords": "Artificial Intelligence, LLM Benchmarking, LLM Evaluation, LLMs, Large Language Models, Machine Learning",
"author": null,
"author_email": "Adam Dejl <adam.dejl18@imperial.ac.uk>",
"download_url": "https://files.pythonhosted.org/packages/86/86/eca13353470cb6c168811c6135b99e6ccfd87a42fdfb522c66921697ac2f/evalsense-0.1.4.tar.gz",
"platform": null,
"description": "> [!NOTE]\n> This project is under development. The API may undergo major changes between versions, so we recommend checking the [CHANGELOG](https://github.com/nhsengland/evalsense/blob/main/CHANGELOG.md) for any breaking changes before upgrading.\n\n# EvalSense: LLM Evaluation\n\n<div align=\"center\">\n\n[](https://github.com/GIScience/badges#experimental)\n[](https://pypi.org/project/evalsense/)\n[](https://github.com/nhsengland/evalsense/blob/main/LICENCE)\n[](https://github.com/nhsengland/evalsense/actions/workflows/evalsense.yml)\n[](https://github.com/nhsengland/evalsense/actions/workflows/guide.yml)\n[](https://www.python.org/)\n[](https://www.typescriptlang.org/)\n[](https://react.dev/)\n\n</div>\n<div align=\"center\">\n\n[](https://www.python.org/downloads/)\n[](https://github.com/astral-sh/uv)\n[](https://github.com/astral-sh/ruff)\n[](https://microsoft.github.io/pyright/)\n[](https://eslint.org/)\n\n</div>\n\n## About\n\nEvalSense is a framework for systematic evaluation of large language models (LLMs) on open-ended generation tasks, with a particular focus on bespoke, domain-specific evaluations. Some of its key features include:\n\n- **Broad model support.** Out-of-the-box compatibility with a wide range of local and API-based model providers, including [Ollama](https://github.com/ollama/ollama), [Hugging Face](https://github.com/huggingface/transformers), [vLLM](https://github.com/vllm-project/vllm), [OpenAI](https://platform.openai.com/docs/api-reference/introduction), [Anthropic](https://docs.claude.com/en/home) and [others](https://inspect.aisi.org.uk/providers.html).\n- **Evaluation guidance.** An [interactive evaluation guide](https://nhsengland.github.io/evalsense/guide) and automated meta-evaluation tools assist in selecting the most appropriate evaluation methods for a specific use-case, including the use of perturbed data to assess method effectiveness.\n- **Interactive UI.** A [web-based interface](https://nhsengland.github.io/evalsense/docs/#web-based-ui) enables rapid experimentation with different evaluation workflows without requiring any code.\n- **Advanced evaluation methods.** EvalSense incorporates recent LLM-as-a-Judge and hybrid [evaluation approaches](https://nhsengland.github.io/evalsense/docs/api-reference/evaluation/evaluators/), such as [G-Eval](https://nhsengland.github.io/evalsense/docs/api-reference/evaluation/evaluators/#evalsense.evaluation.evaluators.GEvalScoreCalculator) and [QAGS](https://nhsengland.github.io/evalsense/docs/api-reference/evaluation/evaluators/#evalsense.evaluation.evaluators.QagsConfig), while also supporting more traditional metrics like [BERTScore](https://nhsengland.github.io/evalsense/docs/api-reference/evaluation/evaluators/#evalsense.evaluation.evaluators.BertScoreCalculator) and [ROUGE](https://nhsengland.github.io/evalsense/docs/api-reference/evaluation/evaluators/#evalsense.evaluation.evaluators.RougeScoreCalculator).\n- **Efficient execution.** Intelligent experiment scheduling and resource management minimise computational overhead for local models. For remote APIs, EvalSense uses asynchrnous parallel calls to maximise throughput.\n- **Modularity and extensibility.** Key components and evaluation methods can be used independently or replaced with user-defined implementations.\n- **Comprehensive logging.** All key aspects of evaluation are recorded in machine-readable logs, including model parameters, prompts, model outputs, evaluation results, and other metadata.\n\nMore information about EvalSense can be found on its [homepage](https://nhsengland.github.io/evalsense/) and in its [documentation](https://nhsengland.github.io/evalsense/docs/).\n\n_**Note:** Only public or fake data are shared in this repository._\n\n## Project Stucture\n\n- The main code for the EvalSense Python package can be found under [`evalsense/`](https://github.com/nhsengland/evalsense/tree/main/evalsense).\n- The accompanying documentation is available in the [`docs/`](https://github.com/nhsengland/evalsense/tree/main/docs) folder.\n- Code for the interactive LLM evaluation guide is located under [`guide/`](https://github.com/nhsengland/evalsense/tree/main/guide).\n- Jupyter notebooks with the evaluation experiments and examples are located under [`notebooks/`](https://github.com/nhsengland/evalsense/tree/main/notebooks).\n\n## Getting Started\n\n### Installation\n\nYou can install the project using [pip](https://pip.pypa.io/en/stable/) by running the following command:\n\n```bash\npip install evalsense\n```\n\nThis will install the latest released version of the package from [PyPI](https://pypi.org/project/evalsense/).\n\nDepending on your use-case, you may want to install additional optional dependencies from the following groups:\n\n- `interactive`: For running experiments interactively in Jupyter notebooks (only needed if you don't already have the necessary libraries installed).\n- `transformers`: For using models and metrics requiring the [Hugging Face Transformers](https://huggingface.co/docs/transformers/index) library.\n- `vllm`: For using models and metrics requiring [vLLM](https://docs.vllm.ai/en/stable/).\n- `local`: For installing all local model dependencies (currently includes `transformers` and `vllm`).\n- `all`: For installing all optional dependencies.\n\nFor example, if you want to install EvalSense with all optional dependencies, you can run:\n\n```bash\npip install \"evalsense[all]\"\n```\n\nIf you want to use EvalSense with Jupyter notebooks (`interactive`) and Hugging Face Transformers (`transformers`), you can run:\n\n```bash\npip install \"evalsense[interactive,transformers]\"\n```\n\nand similarly for other combinations.\n\n### Installation for Development\n\nTo install the project for local development, you can follow the steps below:\n\nTo clone the repo:\n\n`git clone git@github.com:nhsengland/evalsense.git`\n\nTo setup the Python environment for the project:\n\n- Install [uv](https://github.com/astral-sh/uv) if it's not installed already\n- `uv sync --all-extras`\n- `source .venv/bin/activate`\n- `pre-commit install`\n\nNote that the code is formatted with [ruff](https://github.com/astral-sh/ruff) and type-checked by [pyright](https://github.com/microsoft/pyright) in `standard` type checking mode. For the best development experience, we recommend enabling the corresponding extensions in your preferred code editor.\n\nTo setup the Node environment for the LLM evaluation guide (located under [`guide/`](https://github.com/nhsengland/evalsense/tree/main/guide)):\n\n- Install [node](https://nodejs.org/en/download) if it's not installed already\n- Change to the `guide/` directory (`cd guide`)\n- `npm install`\n- `npm run start` to run the development server\n\nSee also the separate [README.md](https://github.com/nhsengland/evalsense/tree/main/guide/README.md) for the guide.\n\n### Programmatic Usage\n\nFor examples illustrating the usage of EvalSense, please check the notebooks under the `notebooks/` folder:\n\n- The [Demo notebook](https://github.com/nhsengland/evalsense/blob/main/notebooks/Demo.ipynb) illustrates a basic application of EvalSense to the ACI-Bench dataset.\n- The [Experiments notebook](https://github.com/nhsengland/evalsense/blob/main/notebooks/Experiments.ipynb) illustrates more thorough experiments on the same dataset, involving a larger number of evaluators and models.\n- The [Meta-Evaluation notebook](https://github.com/nhsengland/evalsense/blob/main/notebooks/Meta-Evaluation.ipynb) focuses on meta-evaluation on synthetically perturbed data, where the goal is to identify the most reliable evaluation methods rather than the best-performing models.\n\n### Web-Based UI\n\nTo use the interactive web-based UI implemented in EvalSense, simply run\n\n```\nevalsense webui\n```\n\nafter installing the package and its dependencies.\n\n## Contributing\n\nContributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are **greatly appreciated**.\n\n1. Fork the Project\n2. Create your Feature Branch (`git checkout -b amazing-feature`)\n3. Commit your Changes (`git commit -m 'Add some amazing feature'`)\n4. Push to the Branch (`git push origin amazing-feature`)\n5. Open a Pull Request\n\n_See [CONTRIBUTING.md](./CONTRIBUTING.md) for detailed guidance._\n\n## License\n\nUnless stated otherwise, the codebase is released under [the MIT Licence][mit].\nThis covers both the codebase and any sample code in the documentation.\n\n_See [LICENSE](./LICENSE) for more information._\n\nThe documentation is [\u00a9 Crown copyright][copyright] and available under the terms\nof the [Open Government 3.0][ogl] licence.\n\n[mit]: LICENCE\n[copyright]: http://www.nationalarchives.gov.uk/information-management/re-using-public-sector-information/uk-government-licensing-framework/crown-copyright/\n[ogl]: http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/\n\n### Contact\n\nThis project is currently maintained by [@adamdejl](https://github.com/adamdejl). If you have any questions, suggestions for new features or want to report a bug, please [open an issue](https://github.com/nhsengland/evalsense/issues/new/choose). For security concerns, please file a [private vulnerability report](https://github.com/nhsengland/evalsense/security/advisories/new).\n\nTo find out more about the [NHS England Data Science](https://nhsengland.github.io/datascience/) visit our [project website](https://nhsengland.github.io/datascience/our_work/) or get in touch at [datascience@nhs.net](mailto:datascience@nhs.net).\n\n### Acknowledgements\n\nWe thank the [Inspect AI development team](https://github.com/UKGovernmentBEIS/inspect_ai/graphs/contributors) for their work on the [Inspect AI library](https://inspect.aisi.org.uk/), which serves as a basis for EvalSense.\n",
"bugtrack_url": null,
"license": null,
"summary": "Tools for evaluating large language models.",
"version": "0.1.4",
"project_urls": {
"Homepage": "https://github.com/nhsengland/evalsense",
"Issues": "https://github.com/nhsengland/evalsense/issues"
},
"split_keywords": [
"artificial intelligence",
" llm benchmarking",
" llm evaluation",
" llms",
" large language models",
" machine learning"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "3cc9a44d2b921937cc8b9fa0a053d66e91facb1ad9920395b1e726981b952cae",
"md5": "758feaf5ec4126f4c9d8f21300cb6a11",
"sha256": "a35c5a24842e51c26f3f7d083ac6e78a205e691e16939743a234a0a9199067bd"
},
"downloads": -1,
"filename": "evalsense-0.1.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "758feaf5ec4126f4c9d8f21300cb6a11",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.12",
"size": 82296,
"upload_time": "2025-09-17T11:18:16",
"upload_time_iso_8601": "2025-09-17T11:18:16.633714Z",
"url": "https://files.pythonhosted.org/packages/3c/c9/a44d2b921937cc8b9fa0a053d66e91facb1ad9920395b1e726981b952cae/evalsense-0.1.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "8686eca13353470cb6c168811c6135b99e6ccfd87a42fdfb522c66921697ac2f",
"md5": "cbef1d2a429c08d2fded777d6a7d5b01",
"sha256": "20029dec5552755235a9b8310b384caa1c446622f51f2b2d5eb0f2dfaf109e12"
},
"downloads": -1,
"filename": "evalsense-0.1.4.tar.gz",
"has_sig": false,
"md5_digest": "cbef1d2a429c08d2fded777d6a7d5b01",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.12",
"size": 57108,
"upload_time": "2025-09-17T11:18:18",
"upload_time_iso_8601": "2025-09-17T11:18:18.028217Z",
"url": "https://files.pythonhosted.org/packages/86/86/eca13353470cb6c168811c6135b99e6ccfd87a42fdfb522c66921697ac2f/evalsense-0.1.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-17 11:18:18",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "nhsengland",
"github_project": "evalsense",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "evalsense"
}